Introduction to the Dataset

These data were taken from this paper in mSystems, investigating the link between soil physico-chemical properties, vegetation, and bacterial community structure.

Abstract:

Global deserts occupy one-third of the Earth’s surface and contribute significantly to organic carbon storage, a process at risk in dryland ecosystems that are highly vulnerable to climate-driven ecosystem degradation. The forces controlling desert ecosystem degradation rates are poorly understood, particularly with respect to the relevance of the arid-soil microbiome. Here we document correlations between increasing aridity and soil bacterial and archaeal microbiome composition along arid to hyperarid transects traversing the Atacama Desert, Chile. A meta-analysis reveals that Atacama soil microbiomes exhibit a gradient in composition, are distinct from a broad cross-section of nondesert soils, and yet are similar to three deserts from different continents. Community richness and diversity were significantly positively correlated with soil relative humidity (SoilRH). Phylogenetic composition was strongly correlated with SoilRH, temperature, and electrical conductivity. The strongest and most significant correlations between SoilRH and phylum relative abundance were observed for Acidobacteria, Proteobacteria, Planctomycetes, Verrucomicrobia, and Euryarchaeota (Spearman’s rank correlation [rs] = >0.81; false-discovery rate [q] = ≤0.005), characterized by 10- to 300-fold decreases in the relative abundance of each taxon. In addition, network analysis revealed a deterioration in the density of significant associations between taxa along the arid to hyperarid gradient, a pattern that may compromise the resilience of hyperarid communities because they lack properties associated with communities that are more integrated. In summary, results suggest that arid-soil microbiome stability is sensitive to aridity as demonstrated by decreased community connectivity associated with the transition from the arid class to the hyperarid class and the significant correlations observed between soilRH and both diversity and the relative abundances of key microbial phyla typically dominant in global soils.

Sample Sites and Transects in the Atacama. Note the hierarchical structure of the dataset - sites within transects.

SECTION 1 - QUALITY CONTROL OF RAW READS

1.1 Packages We’ll Need

Load the packages we’ll need and tell R to use multiple cores where available

    ####### Parallel computation 
        #options (mc.cores=parallel::detectCores ()) 
        
      ## Microbiome Tools
        library(phyloseq)
        library(vegan)
        library(microbiome)
        
        #install.packages("devtools")
        #devtools::install_github("jbisanz/qiime2R")
        library(qiime2R) #for importing QIIME artifacts into phyloseq
        
        library(DESeq2)
        library(labdsv)
        library(ALDEx2)

      ##Plotting     
        library(ggplot2)
        library(RColorBrewer) #pretty colour pallettes
        library(gplots)   #venn diagrams
        library(cowplot) #saving graphs
        library(pheatmap) #alternative heatmaps

      #Data Manipulation
        library(tidyverse)
        library(MicrobeDS)

      #Stats Models
        library(lme4)
        library(MuMIn)
        library(car)

1.1.2 Global Plotting Options

Quick access options for output graphics to make legends and axis labels larger / more legible

#Global Plot Options
            plotopts<- theme(axis.text=element_text(size=20),axis.title=element_text(size=22),strip.text=element_text(size=22),legend.title = element_text(size=20),legend.text = element_text(size=20)) 

1.2 Read in Data Files from Bioinformatics Analysis

Now that we’ve done that, we can read in the files we’ve just copied over. The package QIIME2R has a built in function to import the artefacts we built in QIIME into a format that can be read by the package phyloseq that we’ll use for a lot of our data manipulation and statistics. We tell the function which of the four files correspond to each of the ASV table, taxonomy of those ASVs, sample metadata and the phylogenetic tree of sequences.

### Build Phyloseq Object 
  physeq<-qza_to_phyloseq("table_16S.qza","rooted-tree.qza","taxonomy_16S_SKLEARN.qza", "sample-metadata.tsv")

Once we’ve made the object, we can ‘inspect it’ by typing its name:

  physeq
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1099 taxa and 54 samples ]
## sample_data() Sample Data:       [ 54 samples by 20 sample variables ]
## tax_table()   Taxonomy Table:    [ 1099 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1099 tips and 1095 internal nodes ]

This tells us what’s contained in each of the 4 elements of the phyloseq object. Crucially we can see that we have 54 samples, and have identified ~ 1100 microbial taxa (the otu_table() line). In addition, the sample_data() line tells us we have 22 sample variables in our metadata file. Note how the taxonomy, otu, and tree files all have the same number of taxa/tips.

Note that you can omit the tree from this step - PhyloSeq only needs the first three items to build a complete object. It just means you wont be able to calculate UNIFRAC statistics. Test it.

### Build Phyloseq Object 
  physeq_no_tree<-qza_to_phyloseq(features="table_16S.qza",taxonomy="taxonomy_16S_SKLEARN.qza", metadata="sample-metadata.tsv")
  physeq_no_tree
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1099 taxa and 54 samples ]
## sample_data() Sample Data:       [ 54 samples by 20 sample variables ]
## tax_table()   Taxonomy Table:    [ 1099 taxa by 7 taxonomic ranks ]

1.3 Manipulating Phyloseq Objects

You can access each of the different elements of the phyloseq object using the commands as follows:

Sample Data:    sample_data(physeq)
Taxonomy:       tax_table(physeq)
ASV Matrix      otu_table(physeq)

Note how even though we’re using ASVs (inferred by DADA2 called in QIIME), phyloseq still refers to them as OTUs in the abundance matrix. Let’s test this out on the taxonomy element of our phyloseq object

head(tax_table(physeq),20)
## Taxonomy Table:     [20 taxa by 7 taxonomic ranks]:
##                                  Kingdom      Phylum            
## c2fc985801922717514f78cfae22c41f "d__Archaea" "Thermoplasmatota"
## b56103557f27fd289b1425d7c527e5ff "d__Archaea" "Thermoplasmatota"
## 744b943584ceaf276ae4ad181666daa9 "d__Archaea" "Thermoplasmatota"
## c9e32079e7c3e2a804a14ce56dfdcc05 "d__Archaea" "Thermoplasmatota"
## dfe7e74bb1e45ea73a95e1de0f813fb3 "d__Archaea" "Thermoplasmatota"
## f5f682d559c7a76fe77e9af2daf2ad4c "d__Archaea" "Thermoplasmatota"
## d8ffb44ef64f7589f69f8ade1a4f79d8 "d__Archaea" "Thermoplasmatota"
## f48560eb73e52267345d8ff3e31e1791 "d__Archaea" "Thermoplasmatota"
## cf400229268468ae928415e072520baf "d__Archaea" "Thermoplasmatota"
## 1d1837728dace68b39ad4eab8a996e66 "d__Archaea" "Thermoplasmatota"
## c2252c3545754ab07a79ca051971f458 "d__Archaea" "Nanoarchaeota"   
## f7f5d6bf9b30ab4087d623c77c5e6505 "d__Archaea" "Crenarchaeota"   
## 3fd9f5654f31b22598ba0028a8bdbe0d "d__Archaea" "Crenarchaeota"   
## a56b903521b8ab33a7878b35582803a8 "d__Archaea" "Crenarchaeota"   
## ad752f310a6e3164bdaad203b9c0ed38 "d__Archaea" "Crenarchaeota"   
## 470e5cda86f57a2e32f06633fc253fef "d__Archaea" "Crenarchaeota"   
## 7213f7039e323dbbbb58f197b116a0d0 "d__Archaea" "Crenarchaeota"   
## 7c64d5a7549d568c258e1a26a3cc0aa5 "d__Archaea" "Crenarchaeota"   
## ce857a8a5d2851f376e613e27ec4f5e1 "d__Archaea" "Crenarchaeota"   
## 0b533f8dfe4baaa9ffe85c19be904d23 "d__Archaea" "Crenarchaeota"   
##                                  Class             Order              
## c2fc985801922717514f78cfae22c41f "Thermoplasmata"  "Marine_Group_II"  
## b56103557f27fd289b1425d7c527e5ff "Thermoplasmata"  "Marine_Group_II"  
## 744b943584ceaf276ae4ad181666daa9 "Thermoplasmata"  "Marine_Group_II"  
## c9e32079e7c3e2a804a14ce56dfdcc05 "Thermoplasmata"  "Marine_Group_II"  
## dfe7e74bb1e45ea73a95e1de0f813fb3 "Thermoplasmata"  "Marine_Group_II"  
## f5f682d559c7a76fe77e9af2daf2ad4c "Thermoplasmata"  "Marine_Group_II"  
## d8ffb44ef64f7589f69f8ade1a4f79d8 "Thermoplasmata"  "Marine_Group_II"  
## f48560eb73e52267345d8ff3e31e1791 "Thermoplasmata"  "Marine_Group_II"  
## cf400229268468ae928415e072520baf "Thermoplasmata"  "Marine_Group_II"  
## 1d1837728dace68b39ad4eab8a996e66 "Thermoplasmata"  NA                 
## c2252c3545754ab07a79ca051971f458 "Nanoarchaeia"    "Woesearchaeales"  
## f7f5d6bf9b30ab4087d623c77c5e6505 "Nitrososphaeria" "Nitrososphaerales"
## 3fd9f5654f31b22598ba0028a8bdbe0d "Nitrososphaeria" "Nitrososphaerales"
## a56b903521b8ab33a7878b35582803a8 "Nitrososphaeria" "Nitrososphaerales"
## ad752f310a6e3164bdaad203b9c0ed38 "Nitrososphaeria" "Nitrososphaerales"
## 470e5cda86f57a2e32f06633fc253fef "Nitrososphaeria" "Nitrososphaerales"
## 7213f7039e323dbbbb58f197b116a0d0 "Nitrososphaeria" "Nitrososphaerales"
## 7c64d5a7549d568c258e1a26a3cc0aa5 "Nitrososphaeria" "Nitrososphaerales"
## ce857a8a5d2851f376e613e27ec4f5e1 "Nitrososphaeria" "Nitrososphaerales"
## 0b533f8dfe4baaa9ffe85c19be904d23 "Nitrososphaeria" "Nitrososphaerales"
##                                  Family              
## c2fc985801922717514f78cfae22c41f "Marine_Group_II"   
## b56103557f27fd289b1425d7c527e5ff "Marine_Group_II"   
## 744b943584ceaf276ae4ad181666daa9 "Marine_Group_II"   
## c9e32079e7c3e2a804a14ce56dfdcc05 "Marine_Group_II"   
## dfe7e74bb1e45ea73a95e1de0f813fb3 "Marine_Group_II"   
## f5f682d559c7a76fe77e9af2daf2ad4c "Marine_Group_II"   
## d8ffb44ef64f7589f69f8ade1a4f79d8 "Marine_Group_II"   
## f48560eb73e52267345d8ff3e31e1791 "Marine_Group_II"   
## cf400229268468ae928415e072520baf "Marine_Group_II"   
## 1d1837728dace68b39ad4eab8a996e66 NA                  
## c2252c3545754ab07a79ca051971f458 "GW2011_GWC1_47_15" 
## f7f5d6bf9b30ab4087d623c77c5e6505 "Nitrososphaeraceae"
## 3fd9f5654f31b22598ba0028a8bdbe0d "Nitrososphaeraceae"
## a56b903521b8ab33a7878b35582803a8 "Nitrososphaeraceae"
## ad752f310a6e3164bdaad203b9c0ed38 "Nitrososphaeraceae"
## 470e5cda86f57a2e32f06633fc253fef "Nitrososphaeraceae"
## 7213f7039e323dbbbb58f197b116a0d0 "Nitrososphaeraceae"
## 7c64d5a7549d568c258e1a26a3cc0aa5 "Nitrososphaeraceae"
## ce857a8a5d2851f376e613e27ec4f5e1 "Nitrososphaeraceae"
## 0b533f8dfe4baaa9ffe85c19be904d23 "Nitrososphaeraceae"
##                                  Genus                      
## c2fc985801922717514f78cfae22c41f "Marine_Group_II"          
## b56103557f27fd289b1425d7c527e5ff "Marine_Group_II"          
## 744b943584ceaf276ae4ad181666daa9 "Marine_Group_II"          
## c9e32079e7c3e2a804a14ce56dfdcc05 "Marine_Group_II"          
## dfe7e74bb1e45ea73a95e1de0f813fb3 "Marine_Group_II"          
## f5f682d559c7a76fe77e9af2daf2ad4c "Marine_Group_II"          
## d8ffb44ef64f7589f69f8ade1a4f79d8 "Marine_Group_II"          
## f48560eb73e52267345d8ff3e31e1791 "Marine_Group_II"          
## cf400229268468ae928415e072520baf "Marine_Group_II"          
## 1d1837728dace68b39ad4eab8a996e66 NA                         
## c2252c3545754ab07a79ca051971f458 "GW2011_GWC1_47_15"        
## f7f5d6bf9b30ab4087d623c77c5e6505 "Candidatus_Nitrososphaera"
## 3fd9f5654f31b22598ba0028a8bdbe0d "Candidatus_Nitrososphaera"
## a56b903521b8ab33a7878b35582803a8 "Candidatus_Nitrososphaera"
## ad752f310a6e3164bdaad203b9c0ed38 "Candidatus_Nitrososphaera"
## 470e5cda86f57a2e32f06633fc253fef "Candidatus_Nitrososphaera"
## 7213f7039e323dbbbb58f197b116a0d0 "Nitrososphaeraceae"       
## 7c64d5a7549d568c258e1a26a3cc0aa5 "Nitrososphaeraceae"       
## ce857a8a5d2851f376e613e27ec4f5e1 "Nitrososphaeraceae"       
## 0b533f8dfe4baaa9ffe85c19be904d23 "Nitrososphaeraceae"       
##                                  Species                    
## c2fc985801922717514f78cfae22c41f "uncultured_archaeon"      
## b56103557f27fd289b1425d7c527e5ff "uncultured_archaeon"      
## 744b943584ceaf276ae4ad181666daa9 "uncultured_archaeon"      
## c9e32079e7c3e2a804a14ce56dfdcc05 "uncultured_archaeon"      
## dfe7e74bb1e45ea73a95e1de0f813fb3 "uncultured_archaeon"      
## f5f682d559c7a76fe77e9af2daf2ad4c NA                         
## d8ffb44ef64f7589f69f8ade1a4f79d8 "uncultured_archaeon"      
## f48560eb73e52267345d8ff3e31e1791 "uncultured_archaeon"      
## cf400229268468ae928415e072520baf "uncultured_haloarchaeon"  
## 1d1837728dace68b39ad4eab8a996e66 NA                         
## c2252c3545754ab07a79ca051971f458 "uncultured_archaeon"      
## f7f5d6bf9b30ab4087d623c77c5e6505 NA                         
## 3fd9f5654f31b22598ba0028a8bdbe0d "unidentified_archaeon"    
## a56b903521b8ab33a7878b35582803a8 "unidentified_archaeon"    
## ad752f310a6e3164bdaad203b9c0ed38 "unidentified_archaeon"    
## 470e5cda86f57a2e32f06633fc253fef "unidentified_archaeon"    
## 7213f7039e323dbbbb58f197b116a0d0 "uncultured_thaumarchaeote"
## 7c64d5a7549d568c258e1a26a3cc0aa5 NA                         
## ce857a8a5d2851f376e613e27ec4f5e1 NA                         
## 0b533f8dfe4baaa9ffe85c19be904d23 "metagenome"

We’ve asked for the top 20 records from our taxonomy file. You can see that the columns are named in a hierarchical fashion from Kingdom to Species. You will also notice that there’s variation in the precision with which we have been able to assign taxonomy. Some ASVs can only be assigned to the Bacterial Kingdom. One appear to be have been assigned to species level.

There seem to be a lot of Archaea in the top of this dataset. We can ask hw many of our ASVs belong to different phyla with a quick tabulate command

table(tax_table(physeq)[,1])
## 
##  d__Archaea d__Bacteria 
##          32        1067



We can ask what proportion of assignments we have made at each level with a quick line of code:

apply(tax_table(physeq)[,2:7],2,function(x){1-mean(is.na(x))})
##    Phylum     Class     Order    Family     Genus   Species 
## 0.9927207 0.9863512 0.9772520 0.9681529 0.9080983 0.4868062

Roughly 99% to Phylum level. The figures for Genus and Species level are a bit misleading, because a lot of species are assigned as ‘unindentified’ if you’ve used the SILVA database to assign taxonomy. Overall though, it’s not surprising that precision decreases with more fine-scale taxonomy. Remember we have only sequenced a couple of hundred bases of a highly conserved marker gene. That we are able to assign ANY sequences to species is remarkable.

1.3.2 Filtering Unwanted Taxa


Assuming we want to restrict our analyses to Bacteria, we can filter out taxa assigned to other Phyla like Archaea. FOr this we evaluate whether the 1st column of the taxonomy table - the Phylum - has the entry ‘Bacteria’, and if so to keep it.

physeq_bacteria<-prune_taxa(as.logical(tax_table(physeq)[,1]=="d__Bacteria"),physeq)
physeq_bacteria
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1067 taxa and 54 samples ]
## sample_data() Sample Data:       [ 54 samples by 20 sample variables ]
## tax_table()   Taxonomy Table:    [ 1067 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1067 tips and 1063 internal nodes ]


We can see that we have lost some taxa here

ntaxa(physeq); ntaxa(physeq_bacteria)
## [1] 1099
## [1] 1067

1.4 Inspecting Sample Metadata


Now let’s look at the metadata file using the same code:

colnames(sample_data(physeq_bacteria))
##  [1] "barcode.sequence"                   "elevation"                         
##  [3] "extract.concen"                     "amplicon.concentration"            
##  [5] "extract.group.no"                   "transect.name"                     
##  [7] "site.name"                          "depth"                             
##  [9] "ph"                                 "toc"                               
## [11] "ec"                                 "average.soil.relative.humidity"    
## [13] "relative.humidity.soil.high"        "relative.humidity.soil.low"        
## [15] "percent.relative.humidity.soil.100" "average.soil.temperature"          
## [17] "temperature.soil.high"              "temperature.soil.low"              
## [19] "vegetation"                         "percentcover"


We can see that these metadata are a mix of those associated with sample preparation (BarcodeSequence, extraction concentration etc.), sampling design (Site and Transect Name), and biological variables like pH, humidity and vegetation characteristics.

1.5 Mock Communities and Negative Controls

For this tutorial, we haven’t been provided with any samples that represent positive controls - mock communities of artificially assembled bacteria and so of known composition/ species diversity that we can use to benchmark our assessment of per-sample diversity. You should always include positive controls in your experimental design.

Likewise there are no negative controls, which would allow us to what sequences have cropped up in the negatives controls to get a handle on the likely levels of background contamination. This could include microbes living in the salt-rich PCR mastermix, or the extraction kit, or background contamination in the lab where you processed your samples.

It’s worth noting here that how to deal with negative contamination is a contentious issue. One strategy is to simply remove any sequence that popped up in the negative controls from ALL samples. This is intuitive and has the advantage of being objective, but sometimes you can pick up extremely low abundance sequences that are highly abundant in other samples - suggesting some nefarious aerosol of DNA contaminated the negative from a real sample. Noah Fierer and co. have written a blog on sources of contamination and how to deal with them

Also be sure to check out the R package DECONTAM

1.6 Filtering Based on Abundance or Coverage Rules

We might want to subset the data to only reads that appear in our dataset with a given frequency. This can make our lives simpler by reducing computation time for downstream steps, and also reduce noise caused by low-abundance sequences with low coverage where can be less certain of their relative abundance. We could do this many ways, including only selecting:

1) Seqs that appear in at least X% of samples
2) Seqs that appear at a frequency of Y% in the entire dataset
3) Seqs that pass a certain raw reads threshold, i.e. must appear at least Z times in the dataset. 

Here, we’ll subset to only sequences with a least 100 reads:

    # Removing OTUs with fewer than 100 reads
      physeq100<-prune_taxa(taxa_sums(physeq_bacteria)>100,physeq_bacteria)
      
      #How Many Taxa Did We Lose?    
          ntaxa(physeq100); ntaxa(physeq_bacteria)
## [1] 123
## [1] 1067


Now let’s compute exactly how may taxa we lost.

          ntaxa(physeq_bacteria)- ntaxa(physeq100)
## [1] 944

1.7 Assessing Post-QC Sample Coverage and Library Sizes

Now that we’ve done that, we can check what our post-QC library sizes are. It’s a good idea to report these in manuscripts and thesis chapters. Here was ask what the minimum, maximum and mean library sizes per sample are.

<br>
 ############## POST QC LIBRARY SIZES
      
      ###############
      # Post-QC Library Stats
      ###############
      
      mean(sample_sums(physeq100))
## [1] 692.2037
      range(sample_sums(physeq100))
## [1]   17 1799

We might be interested in checking that our per-group coverage is roughly equal. Lack of equal coverage might occur if a particular subset of samples amplify poorly and are hard to pool at the same concentration as other samples. They might be older samples, or from a different sampling location, that systematically exhibit lower coverage. We’ve had reviewers ask for these checks before, so it might be wise to do them as a matter of course.

Let’s plot reads as a function of site ID

      #Make a data frame of read depths
        soil_reads<-data.frame(reads=sample_sums(physeq100))
        

      #Add on the sample ID
        soil_reads$Sample<-rownames(soil_reads)
        
      #Extract the Metadata from the phyloseq object 
        soil_meta<-data.frame(sample_data(physeq))
        soil_meta$Sample<-rownames(soil_meta)

      #Join on the Metadata
        soil_reads<-left_join(soil_reads,soil_meta,"Sample")
      
      #Some Boxplots of COverage by Population using ggplot2
        #library(ggplot2)
        ggplot(soil_reads,aes(x=site.name,y=reads)) + geom_boxplot(aes(fill=site.name)) + plotopts + theme(axis.text.x = element_text(angle = 45, hjust=1))

We need to interpret these with a bit of caution, as this a massively subsetted dataset with only a few samples per site, so estimates of among-site variability might be unreliable. Your own datasets will vary.



1.7.1 TASK

Adapt the code above to plot the range of sample reads for each of the two ‘Vegetation’ categories



1.8 Rarefaction Curves

Rarefaction curves allow us to assess whether we’ve likely discovered all the microbial ‘species’ / ASVs in a sample. The lower the true species diversity, and the higher the sampling depth (number of reads), the more likely we are to have saturated our species discovery curves. We can plot per-sample rates of species discovery using the ‘rarecurve’ function in vegan. By default this will plot a different line for each sample, so if you have hundreds of samples, these graphs can look very cluttered!

rarecurve(t(otu_table(physeq_bacteria)), step=50, cex=0.5)

1.9 Rarefying Samples

Ok - now we’re ready to do some calculations. But first, we need to remove any artefacts caused by uneven sampling depth across samples, using the command rarefy_even_depth.

This command will automatically rarefy to the lowest-coverage sample sampling depth if you do not specify a sampling depth. This means you won’t lose any samples, but could potentially throw away lots of data if there’s a single library that amplified poorly.

What is super important is that you specify a random number seed. This is because the rarefying step is based on a random subsample of each library, and providing a random number seed means your analysis will be completely reproducible (rather than differing every time because of sampling error).

Note that rarefying libraries is quite a contentious issue. See this paper here for an explanation.

     ####### Rarefying Samples
        
        #remind ourselves what the minimum coverage is 
          min(sample_sums(physeq))
## [1] 27

This is because there are some samples here that didn’t sequence very well at all - or lost a lot of reads in the QC steps in the bioinformatic workflow - or both. We can query our data and ask which samples have fewer than 500 reads

names(sample_sums(physeq100))[which(sample_sums(physeq100)<500)]
##  [1] "BAQ2420.1.1" "BAQ2420.1.3" "BAQ2420.3"   "BAQ2462.3"   "BAQ2687.1"  
##  [6] "BAQ2687.3"   "BAQ2838.1"   "BAQ2838.2"   "BAQ2838.3"   "BAQ4166.1.1"
## [11] "YUN3259.1.1" "YUN3259.1.3" "YUN3259.2"   "YUN3346.2"   "YUN3533.1.2"
## [16] "YUN3856.2"


And how many there were in total

length(sample_sums(physeq_bacteria)[which(sample_sums(physeq_bacteria)<500)])
## [1] 5

So if we choose 500 as our sampling depth for rarefying, we will lose these samples as they don’t satisfy our minimum depth criterion. We can do that now. It’s very important that we we set a random number seed so that the workflow is reproducible:

    # Rarefy to a custom minimum library size, set random number see for reproducibility
        ps_rare<-rarefy_even_depth(physeq_bacteria,500,rngseed = 150517)
## `set.seed(150517)` was used to initialize repeatable random subsampling.
## Please record this for your records so others can reproduce.
## Try `set.seed(150517); .Random.seed` for the full vector
## ...
## 5 samples removedbecause they contained fewer reads than `sample.size`.
## Up to first five removed samples are:
## BAQ2838.2BAQ2838.3YUN3259.1.1YUN3259.1.3YUN3346.2
## ...
## 53OTUs were removed because they are no longer 
## present in any sample after random subsampling
## ...

Note how we get helpful warnings about which samples we lost, and also which sequences we lost because they were found only in those samples. Again this demonstrates how phyloseq operates on the whole phyloseq object and how all the tables within an object are linked.


SECTION 2: COMMUNITY STATISTICS - ALPHA DIVERSITY

2.1 Manipulating Phyloseq Objects

Much like using ‘subset’ on dataframes in R, phyloseq has some built in functions for manipulating phyloseq objects to select only certain samples, or sequences (taxa/ASVs) of interest.

Here we’ll make use of the prune_samples command that allows us to subset our phyloseq object to only certain sample of interest meeting a certain condition.

The general syntax is: prune_samples(condition to select, phyloseq object to use)


Subset Our Phyloseq Object to Only Samples Containing Vegetation Cover

#Prune
  veg_yes<-prune_samples(sample_data(ps_rare)$vegetation =="yes",ps_rare)


Now we can check we’ve done what we think we’ve done.

#Inspect
  veg_yes
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1014 taxa and 32 samples ]
## sample_data() Sample Data:       [ 32 samples by 20 sample variables ]
## tax_table()   Taxonomy Table:    [ 1014 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1014 tips and 1010 internal nodes ]
  table(sample_data(veg_yes)$vegetation)
## 
## yes 
##  32


2.2.1 TASK

Adapt the code above to subset the data to only those records where Average Relative Humidity is less than 80


2.2 Making a Venn Diagram of Shared Sequences by Group (Population)

Now that we have a quality controlled, cleaned and rarefied dataset, we can do some calculations and visualizations. Let’s start by making a venn diagram illustrating sequences that are unique to each sampling group for a variable of interest, and those that are shared, The latter is often used as evidence of a ‘core microbiome’, if such a thing exists.

Here we will be comparing between soil that has vegetation cover and that without, but your groups could be anything - sampling times / seasons, sites etc.

Now onto our Venn diagram

###################   SHARED AND UNIQUE SEQUENCES BY SITE
        #library(gplots) #required for Venn diagram. Online Venn diagram tools like Venny are also available. 
        
        #Empty lists - we have two groups so we need two empty lists
          venn.list<-rep(list(NA),2)
          poplist<-rep(list(NA),2)
          pops<-as.character(unique(sample_data(ps_rare)$vegetation))
          
       #Populate List with the Names of Groups we Want to Compare
        poplist[[1]]<-pops[1]
        poplist[[2]]<-pops[2]

      #Name the Groups in the Vennlist
        names(venn.list)<-pops
        
      #Loop over each population, subset the phyloseq object to just that population and work out which SVs are in that population, store those names in the lisr
        for(k in 1:2){
          
        #Subset Phyloseq Object to One Pop at a time  
          phy.sub<-prune_samples(sample_data(ps_rare)$vegetation %in% poplist[[k]],ps_rare)
          
        #Calculate which ASVs are present (non-zero abundance)  
          phy.sub.keep<-apply(otu_table(phy.sub),1,function(x){sum(x)>0})
          
        #Prune down to Inly those ASVs  
          phy.sub.sub<-prune_taxa(phy.sub.keep,phy.sub)
          
        #Extract the names of these ASVs and store them in the appropriate list slot  
          venn.list[[k]]<-rownames(otu_table(phy.sub.sub))
        } #loop end
        
        
      #Draw Venn Diagram, automatically calculate overlap in the list and unique variants
        venn(venn.list)

This is cool! There are ~ 100 shared ASVs across both vegetation types, but clearly lots of unique OTUs that seem to only occur within each vegetation type.

Note that some of these packages like gplots only allow a maximum number of groups. Gplots allows up to 5, so if you have more to compare then you might have to use another package. Note that Venn diagrams with loads of categories are often horrid to look at and require a lot of effort to interpret. There are also some cool web based tools like BioVenn and Venny that let you paste in sequences in each population and specify colours of groups. You can still use the code above to identify per-group ASVs to then paste into the appropriate boxes on these web platforms.
For a discussion on the pros and cons of Venn diagrams for microbiome data see this excellent paper here


2.2.1 Loop-Free Venn Diagram Creation

We can make the above code without the complex loop, if that helps to see what the code is doing. Hopefully the answer here will be the same as the one above!

###### LOOP FREE VENN DIAGRAM CREATION
## repeat for each condition you want to look at 

      #Subset Based On Condition 
        veg_yes<-prune_samples(sample_data(ps_rare)$vegetation =="yes",ps_rare)
      #Work out which ASVs are present (>0 abundance) 
        veg_yes_keep<-apply(otu_table(veg_yes),1,function(x){sum(x)>0})
      #Subset to Only Those ASVS  
        veg_yes_sub<-prune_taxa(veg_yes_keep,veg_yes)
      #Strip Out the Rownames of the Subsetted Object  
        veg_yes_asvs<-rownames(otu_table(veg_yes_sub))
        head(veg_yes_asvs)
## [1] "99fa718995b3129c258bd12bdabb4bec" "2dcce6012bdc7a9cb15900ee79beb2ed"
## [3] "8133d2b0b8fcb93ae67ebaaec9667e7b" "996b8c93a5184a88d3ec0d3965fe31b2"
## [5] "d183373f95d7758e5883ac24de921410" "3a51360d9283ba62bef12229ba447c63"
      #Same as above for No    
        veg_no<-prune_samples(sample_data(ps_rare)$vegetation =="no",ps_rare)
        veg_no_keep<-apply(otu_table(veg_no),1,function(x){sum(x)>0})
        veg_no_sub<-prune_taxa(veg_no_keep,veg_no)
        veg_no_asvs<-rownames(otu_table(veg_no_sub))
        
      #Make a List for the Venn Diagram 
        veg_list<-list(yes=veg_yes_asvs,no=veg_no_asvs)
        
      #Make the Venn diagram 
        library(gplots)
        venn(veg_list)

2.3 Calculating Alpha Diversity

Alpha diversity generally corresponds to overall species (sequence) richness in a sample. Some alpha diversity metrics account for overall richness but also how those sequences are distributed among the samples (evenness). It’s important to note that these metrics will all be correlated because they all measure similar things. Some papers will report tests on multiple alpha diversity values, but this arguably exposes them to multiple testing issues - that is increased risk of finding a false positive by asking the same statistical question over and over.

Calculate Alpha Diversity for Samples


    ### Estimate lots of alpha diversity measures (note Chao1 should be ignored with DADA2 as Chao1 is based on frequency of singletons, but DADA2 doesnt allow singletons)    
      soil_richness<-estimate_richness(ps_rare,measures=c("Observed","Shannon","InvSimpson"))
        head(soil_richness)
##             Observed  Shannon InvSimpson
## BAQ2420.1.1       23 2.756977   12.63903
## BAQ2420.1.2       26 2.921896   15.31863
## BAQ2420.1.3       36 3.248181   20.08677
## BAQ2420.2         33 3.319443   24.39500
## BAQ2420.3         36 3.343809   24.84596
## BAQ2462.1         34 3.093297   14.04968


Proof That Alpha Diversity Metrics are Correlated

     with(soil_richness,plot(Shannon,InvSimpson))


We can inspect the spread of our richness measures here:

subset(soil_richness,Observed<5)
## [1] Observed   Shannon    InvSimpson
## <0 rows> (or 0-length row.names)
ggplot(soil_richness,aes(x=Observed)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.3.2 Alpha Diversity: Rarefy Once or Many Times?

Rarefying is a random subsample of our reads to equalise read depth, and from these we can calculate traits of interest like ASV-richness. But are we worried that the answer will change depending on that single random subsample? Can we fix this by repeating the rarefying multiple times to calculate our traits of interest? Yes we can!

Bad news is we will be using loops to do this. For each iteration, we will: i) rarefy, ii) calculate and store a **per-sample* alpha diversity value iii) store the values to perform calculations on later

## How Many Times Should We Rarefy?
  n_rarefy_sims<-100

#What Should Our Rarefying Threshold Be?
  rarefy_depth<- 500

#How Many Samples Are We Expecting? 
  nsamples_permute<-length(sample_sums(physeq_bacteria)[which(sample_sums(physeq_bacteria)>=rarefy_depth)])

#An empty matrix to store our data
  richvals_matrix<-matrix(NA,ncol=n_rarefy_sims,nrow=nsamples_permute)

  ##### RUN THE LOOP
    for (k in 1:n_rarefy_sims){
      
      #New Rarefying Step - no random number seed or all results will be the same
        ps_rare_temp<-rarefy_even_depth(physeq_bacteria,rarefy_depth)
        
      #Calculate a Trait - we'll use Shannon Diversity
        richvals_matrix[,k]<-as.matrix(estimate_richness(ps_rare_temp,measures="Shannon"))[,1]
    }

2.3.3 Visualising Per-Sample Variation in Estimates from Rarefying

We’ve now made a matrix where each row represents error in the estimation of a trait of interest due to rarefying. It might be useful to start by visualizing this per-sample uncertainty

#Name the Samples 
  rownames(richvals_matrix)<- sample_names(rarefy_even_depth(physeq_bacteria,rarefy_depth))
## You set `rngseed` to FALSE. Make sure you've set & recorded
##  the random seed of your session for reproducibility.
## See `?set.seed`
## ...
## 5 samples removedbecause they contained fewer reads than `sample.size`.
## Up to first five removed samples are:
## BAQ2838.2BAQ2838.3YUN3259.1.1YUN3259.1.3YUN3346.2
## ...
## 62OTUs were removed because they are no longer 
## present in any sample after random subsampling
## ...
#Format FOr Plotting 
  t(richvals_matrix) %>% as.data.frame() %>% pivot_longer(everything(),names_to = "sample",values_to = "shannon") -> rarefy_permute_plotdata
  
##Plot
  ggplot(rarefy_permute_plotdata,aes(x=sample,y=shannon)) + geom_violin(aes(fill=sample)) + theme(legend.position="none") + plotopts + theme(axis.text.x = element_text(angle = 45, hjust=1,size=10))

Hopefully you can see from this that variation AMONG SAMPLES (mostly biological variation) is far greater than variation WITHIN SAMPLES introduced by our choice of rarefying threshold (statistical variation).

We can also plot average value produced from these permutations against the single-step values from above

plot(apply(richvals_matrix,1,mean), as.numeric(estimate_richness(ps_rare,measures="Shannon")[,1]))
abline(0,1)

This is encouraging, though you might notice some deviations from the expected 1:1 line

2.3.4 TASK

Repeat the above exercise but change the rarefying threshold to your choice of value. Smaller values will likely work better for this dataset

2.4 Plots of Alpha Diversity by Groups of Interest

Phyloseq has built in plotting functions for alpha diversity, where you can choose the metric you wish to plot. Note you don’t have to have used ‘estimate_richness’ first for this to work. We can use the ‘x=’ argument to specify how to group the alpha diverstiy metrics. First we can just plot the Shannon diversity for all samples separately:

    ## Make a plot of our chosen richness estimator by individual sample   
        plot_richness(ps_rare,x="samples",measures=c("Shannon"))  

But this isn’t very useful. More interesting would be to see if alpha diversity varied by group, in this case good old Vegetation

    ## Make a plot of our chosen richness estimator by population   
        plot_richness(ps_rare,x="vegetation",measures=c("Shannon"))  


2.4.1 TASK

Plot the alpha diversity metric of your choice by site and transect.

2.5 Plots of Alpha Diversity with Other Variables

But we might want to plot the relationship between alpha diversity and other variables, or even do the same but controlling for a third variable. Here we plot the relationship between alpha diversity and Elevation.
Because phyloseq calls ggplot, you can add ggplot customisation layers to the plot that specify aesthetics like plotting character, size and fill, and axis text sizes

        plot_richness(ps_rare,x="elevation",measures=c("Shannon")) + geom_point(size=5,pch=21,fill="white") + theme_classic()  + labs(x="Elevation(m)",y="Alpha Diversity (Shannon)") + theme(axis.text=element_text(size=12),axis.title = element_text(size=12))

Definitely looks like there’s something going on here, which we’ll explore formally later.
We might want to do the same, but manipulate the fill colour of the points depending on a third variable - in this case Vegetation status

        plot_richness(ps_rare,x="elevation",measures=c("Shannon")) + geom_point(size=5,pch=21,aes(fill=vegetation)) + theme_bw() + labs(x="Elevation(m)",y="Alpha Diversity (Shannon)") + theme(axis.text=element_text(size=12),axis.title = element_text(size=12))

Note that we can also achieve the above manually by extracting our richness estimates created with the ‘estimate_richness’ function and tacking on our metadata, followed by simular plotting arguments. First we add a column called ‘Sample’ to the richness dataframe to be used as an index for joining the two dataframes.

      #Add Richness Onto Our Metadata
         soil_richness$Sample<-rownames(soil_richness)
         soil_richness<-left_join(soil_richness,soil_meta,"Sample")
     
          #Plot the correlation between heterozygosity and richness (shannon diversity)
              ggplot(soil_richness,aes(x=elevation,y=Shannon)) + geom_point(size=5,pch=21,fill="white") + theme_bw() + labs(x="Elevation(m)",y="Alpha Diversity (Shannon)") + theme(axis.text=element_text(size=12),axis.title = element_text(size=12))


2.6 Correlation Tests with Alpha Diversity


Looks like there’s something going on. How about we ask if there’s any correlation between alpha diversity and elevation

              with(soil_richness,cor.test(elevation,Shannon))
## 
##  Pearson's product-moment correlation
## 
## data:  elevation and Shannon
## t = 4.4477, df = 47, p-value = 5.291e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3106001 0.7158941
## sample estimates:
##       cor 
## 0.5442611

2.7 More Custom Plotting

Looks convincing, but what we haven’t done here is account for population effects - Site and Transect. So let’s make the same plot again but identify which Transects individuals come from. Code versions below for plotting directly from phyloseq and from our metadata dataframe.

  ##### USING OUR COMBINED DATAFRAME

          # Same as above, but identify which population the individuals come from 
              pop1<-ggplot(soil_richness,aes(x=elevation,y=Shannon)) + geom_point(size=5,pch=21,aes(fill=site.name)) + theme_bw() + labs(x="Elevation(m)",y="Alpha Diversity (Shannon)") + theme(axis.text=element_text(size=12),axis.title = element_text(size=12))
              pop1

          # Same as above, but 'facet' each population to a separate panel
              pop1 + facet_wrap(.~transect.name)

   ##### USING PHYLOSEQ
        pop2<-plot_richness(ps_rare,x="elevation",measures=c("Shannon")) + geom_point(size=5,pch=21,aes(fill=site.name)) + theme_bw() + labs(x="Elevation (m)",y="Alpha Diversity (Shannon)") + theme(axis.text=element_text(size=12),axis.title = element_text(size=12))
        pop2

        # Same as above, but 'facet' each population to a separate panel
            pop2 + facet_grid(.~transect.name)

If we don’t like the default colours R gives us, we can choose our own. R Color Brewer is brilliant for this and works directly within its own ggplot call. We can even just use the ‘pop1’ object we’ve already created to change the colours

       #Plot the correlation between heterozygosity and richness (shannon diversity)
             elev_shannon_default<- ggplot(soil_richness,aes(x=elevation,y=Shannon)) + geom_point(size=5,pch=21,aes(fill=transect.name)) + theme_bw() + labs(x="Elevation(m)",y="Alpha Diversity (Shannon)") + theme(axis.text=element_text(size=12),axis.title = element_text(size=12)) 
elev_shannon_default

 ######## Adding Some Custom Plotting Colours 
elev_shannon_default + scale_fill_brewer(palette = "Set3")

2.7.1 TASK

Have a look at all the R Color Brewer palettes and make the graph look how you want it to. You can replace the palette names directly above e.g.“Set3” etc

display.brewer.all()

2.8 Saving Graphs in Custom Output Formats

The package ‘cowplot’ has some flexible implementations for saving ggplot-style plots at publication quality and in a range of formats. All you need to do is have all your layers saved as one object for export. Let’s say the following is our preferred graph:

       pop1 + scale_fill_brewer(palette="Set1")     

We can save this colour layer as a new object and export using cowplot’s implementation of ‘ggsave2’. The general syntax is:

ggsave2(file name of output file.FORMAT, saved graph object)

The nice thing is we might want to save it as both a pdf and a tiff file, and simply changing the file format in the filename will do so. If a journal asks for a minimum resolution / dpi, you can specify that too. (see ?ggsave2)

#All Layers Together
pop_custom_colours<-pop1 + scale_fill_brewer(palette="Set1")  

    #PDF Version
        ggsave2('Shannon Diversity by Transect.pdf',pop_custom_colours)
## Saving 7 x 5 in image
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
    #TIFF version    
        ggsave2('Shannon Diversity by Transect.tiff',pop_custom_colours)
## Saving 7 x 5 in image
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
    #PDF Version With Custom Size in Inches
                ggsave2('Shannon Diversity by Transect LARGE.pdf',pop_custom_colours,width=12,height=10,units="in")
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors

SECTION 3: COMMUNITY STATISTICS - BETA DIVERSITY

Next we can look at metrics of community structure - differences among samples in the distribution and relative abundance of our microbial taxa

3.1 Heatmaps of Community Structure

First, we can visualise these differences using a heatmap, which will show us differential abundances of each ASV within and across samples. First we subset the data to only the 50 most abundant ASVs to make the plot simpler to interpret.

  ####################  HEATMAP (baked into phyloseq)
        
      #Subset to Most 50 abundant OTUs
        ps_rare_top50 <- prune_taxa(names(sort(taxa_sums(ps_rare),TRUE)[1:50]), ps_rare)

Next, we plot the heatmap By default, the heatmap will cluster samples along the x axis based on similarity of total community structure. Here we’d hope that samples cluster by population.

      #Plot Heatmap with X axis ordered Inter-Sample Distance    
        plot_heatmap(ps_rare_top50,"NMDS",distance = "bray")    
## Warning: Transformation introduced infinite values in discrete y-axis

They do, but there is some ‘noise’ here. Still, this is convincing evidence that samples within a Transect are more similar to one another in terms of microbial community structure than to samples in different Transects. It also makes it easy to spot the samples seemingly dominated by just one sequence.
If we don’t want that behaviour, we can manually override the clustering and tell it to group samples by a variable of interest, such as river

      #Plot Heatmap with X axis ordered by population   
        plot_heatmap(ps_rare_top50,"NMDS",distance = "bray",sample.label="site.name",sample.order = "site.name")    
## Warning: Transformation introduced infinite values in discrete y-axis

3.1.1 Other Heatmap Packages and Functions

There are plenty of heatmap packages in R, and one I quite like is the pheatmap package. It allows you to annotate the heatmaps with useful metadata, like sampling site etc.

Here we’ll plot a heatmap with pheatmap, and annotate samples based on whether there was vegetation on that sample.Note we’re not clustering by column (sample) here, just letting them be plotted in site-order as they appear in the data frame.

#library(pheatmap)  

#Generate Metadata for Plotting Colours
  veg_data<-data.frame(Vegetation=sample_data(ps_rare_top50)$vegetation)
  rownames(veg_data)<-rownames(sample_data(ps_rare_top50))


#Plot Heatmap 
  pheatmap(otu_table(ps_rare_top50),cluster_cols = FALSE,scale="row",annotation_col = veg_data)

    pheatmap(otu_table(ps_rare_top50),cluster_cols = TRUE,scale="row",annotation_col = veg_data)

3.1.2 TASK

Modify the Code above to use R Color Brewer to change the plotting colours.

3.2 Barplots of Community Structure

We can summarise our microbial communities using compositional barplots, which provide a quick and accessible way of visualising differences in taxonomy at different hierarchies.

The first step here is to remove samples in our phyloseq object with low reads, otherwise this will break our averaging

#Filter Samples With No Data
  physeq_subset<-prune_samples(sample_sums(physeq_bacteria)>0,physeq_bacteria)

Next we tell phyloseq to aggregate reads at the taxonomic level of interest - here Phylum - and to transform the data to relative abundance (fraction of reads, rather than count of reads)

#Aggregate To Phylum Level and Transform to Relative Abundance  
    physeq_phylum <- physeq_subset %>%
        aggregate_taxa(level = "Phylum") %>%  
      microbiome::transform(transform = "compositional")

Then we can plot the composition for each sample, order by the abundance of the Phylum Firmicutes (you can change this to any phylum present)

#Plot Composition By Sample ID    
physeq_sample_plot <- physeq_phylum %>%
  plot_composition(sample.sort = "Firmicutes")

  physeq_sample_plot

Alternatively, we can group our data by sample metadata of interest - here we can bin all samples from the same transect together.

#Plot Composition Grouped By Sample Metadata
  physeq_grouped_plot <- physeq_phylum %>%
    plot_composition(sample.sort = "Firmicutes", average_by = "transect.name")
  physeq_grouped_plot

These graphs can get messy sometimes, so it’s often useful to subset the data to some of the most abundant members of the taxonomy of interest. Here we’ll calculate the top 5 most abundant phyla in our data, subset the phyloseq object to those data, and remake our graph.

#### Run Again But With Top 'N' Phyla 
  
#What Are the Names of the most abundant phyla?  
  physeq_phylumcollapse<- physeq_subset %>% aggregate_taxa(level="Phylum")
  physeq_top5phyla = names(sort(taxa_sums(physeq_phylumcollapse), TRUE)[1:5])

#Subset the phyloseq object to those phyla   
  physeq_top5phylum_filter<-subset_taxa(physeq_subset,Phylum %in% physeq_top5phyla)
  
#Remake Our Graph  but with grouping by VEGETATION

  physeq_top5phylum_plot <- physeq_top5phylum_filter %>%
    aggregate_taxa(level = "Phylum") %>%  
    microbiome::transform(transform = "compositional") %>%
    plot_composition(sample.sort = "Firmicutes", average_by = "vegetation")
  physeq_top5phylum_plot

We can also add custom colours to these graphs to make the distinctions between categories easier

physeq_top5phylum_plot + scale_fill_brewer(palette="Set3")

3.2.1 TASK

Make a compositional barplot where taxa are aggregated at the FAMILY Level, split by site

Make a compositional barplot where samples are binned into SITES, not transects.

3.3 Ordination

It can be useful to visualize microbial community structure rather than rely on a matrix of numbers. There are many ways to do so, all with jazzy acronyms, including Non-Metric Multidimensional Scaling (NMDS), Principal Coordinates Analysis (PCoA), Constrained Correspondence Analysis (CCA), Detrended Correspondence Analysis (DCA) etc. They each make different assumptions about the data, but they share the same principle of reducing variation in the relative abundances hundreds of microbial taxa into 2 or 3 axes that can then be plotted/visualised, or statistically tested.

For an excellent guide to the pros and cons of all of these methods, see this paper by Paliy & Shankar

3.3.1 Ordination I - within Phyloseq

Here we’ll use the built in phyloseq functions to perform NMDS ordination on Bray-Curtis distances among samples. We specify ‘k=2’ to state that the variation should be condensed into 2 axes.


As a general rule, the ‘stress’ value of an NMDS ordination should be below 15% / 0.15, and ideally below 10%. Stress is a measure of how ‘easy’ it was to reduce the variation in your microbial abundance matrix into the number of axes you specify. If the stress value is too large, or the model doesn’t converge, then try setting k to 3. If you do this, make sure to state in your methods of your paper that the ordination was done across 3 axes.

  #################### ORDINATION  WITHIN PHYLOSEQ  
        
    ### Ordination using built in functions in phyoseq (calls vegan)    
        ord.nmds.bray <- ordinate(ps_rare, method="NMDS",k=3, distance="bray",trymax=50)
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1311407 
## Run 1 stress 0.130744 
## ... New best solution
## ... Procrustes: rmse 0.08235512  max resid 0.2411586 
## Run 2 stress 0.1297148 
## ... New best solution
## ... Procrustes: rmse 0.04422977  max resid 0.2710738 
## Run 3 stress 0.1255234 
## ... New best solution
## ... Procrustes: rmse 0.0919183  max resid 0.2401458 
## Run 4 stress 0.1288516 
## Run 5 stress 0.1266376 
## Run 6 stress 0.1275053 
## Run 7 stress 0.1266969 
## Run 8 stress 0.1264839 
## Run 9 stress 0.1256807 
## ... Procrustes: rmse 0.01633202  max resid 0.06366843 
## Run 10 stress 0.1280006 
## Run 11 stress 0.1265218 
## Run 12 stress 0.1281518 
## Run 13 stress 0.126604 
## Run 14 stress 0.1270444 
## Run 15 stress 0.1306928 
## Run 16 stress 0.1290179 
## Run 17 stress 0.12919 
## Run 18 stress 0.1290175 
## Run 19 stress 0.1254496 
## ... New best solution
## ... Procrustes: rmse 0.005840751  max resid 0.02249395 
## Run 20 stress 0.1293103 
## Run 21 stress 0.127207 
## Run 22 stress 0.1264251 
## Run 23 stress 0.1289484 
## Run 24 stress 0.1269614 
## Run 25 stress 0.1266924 
## Run 26 stress 0.1286795 
## Run 27 stress 0.1254753 
## ... Procrustes: rmse 0.003605075  max resid 0.01789077 
## Run 28 stress 0.1264333 
## Run 29 stress 0.1293387 
## Run 30 stress 0.1254494 
## ... New best solution
## ... Procrustes: rmse 0.0006687956  max resid 0.002601191 
## ... Similar to previous best
## *** Solution reached
        #scores(ord.nmds.bray)
        
      #Make Ordination Plot and Store it  
        ord1<-plot_ordination(ps_rare, ord.nmds.bray, color="transect.name", title="Bray NMDS")
        
      #Plot with Ellipses assuming normal distribution  
        ord1 #+ stat_ellipse(type="norm") + theme_bw()

      #Same But with Filled Ellipses (note that 'alpha' changes how transluscent the fill is)
        ord1 + stat_ellipse(geom="polygon",aes(fill=transect.name),type="norm",alpha=0.4) + theme_bw()

3.3.2 TASK

Modify the Code above to use R Color Brewer to change the plotting colours.

3.3.3 Ordination II - using ‘vegan’

Now we’ll do the same using vegan’s equivalent functions. To do so we need to extract the relevant dataframes from our phyloseq object.

#################### ORDINATION  USING VEGAN   
        
        
     #### Vegan Function to Convert Phyloseq into Something Vegan can call directly
        vegan_otu <- function(physeq) {
          OTU <- otu_table(physeq)
          if (taxa_are_rows(OTU)) {
            OTU <- t(OTU)
          }
          return(as(OTU, "matrix"))
        }
        
        #Load Libraries
        #library(vegan)
        #library(RColorBrewer)
        
        #Convert OTU table to abundance matrix
        soil.v<-vegan_otu(ps_rare)
        
        #Convert Sample Data to     
        soil.s<-as(sample_data(ps_rare),"data.frame")
        
      ###### NMDS Ordination
        soil.nmds<-metaMDS(soil.v,k=3,distance="bray",trymax = 50)       
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1311407 
## Run 1 stress 0.1287611 
## ... New best solution
## ... Procrustes: rmse 0.08871537  max resid 0.2775393 
## Run 2 stress 0.1299215 
## Run 3 stress 0.1270068 
## ... New best solution
## ... Procrustes: rmse 0.08542  max resid 0.2763549 
## Run 4 stress 0.129896 
## Run 5 stress 0.1255131 
## ... New best solution
## ... Procrustes: rmse 0.04094959  max resid 0.2534407 
## Run 6 stress 0.1264248 
## Run 7 stress 0.1270382 
## Run 8 stress 0.1282543 
## Run 9 stress 0.1292836 
## Run 10 stress 0.1293186 
## Run 11 stress 0.1299762 
## Run 12 stress 0.1254493 
## ... New best solution
## ... Procrustes: rmse 0.006136808  max resid 0.02140133 
## Run 13 stress 0.1292856 
## Run 14 stress 0.1292054 
## Run 15 stress 0.1291509 
## Run 16 stress 0.1267317 
## Run 17 stress 0.1308613 
## Run 18 stress 0.1306514 
## Run 19 stress 0.1274428 
## Run 20 stress 0.1264059 
## Run 21 stress 0.1265212 
## Run 22 stress 0.1292011 
## Run 23 stress 0.1284072 
## Run 24 stress 0.1301848 
## Run 25 stress 0.128199 
## Run 26 stress 0.1254496 
## ... Procrustes: rmse 0.0007108842  max resid 0.002519635 
## ... Similar to previous best
## *** Solution reached
        stressplot(soil.nmds)

        #scores(soil.nmds)


3.3.4 NMDS Plot With Vegan

    ##################### NMDS Plot using VEGAN
       

      #Select Some Colours from RColorBrewer  
        ordination_cols<-brewer.pal(5,"Set2")
        cols<-data.frame(pop=unique(soil.s$transect.name),col=ordination_cols[1:length(unique(soil.s$transect.name))])
        
      #Expand so that each sample in the database has a colour value  
        cols.expand<-as.character(cols[,2][match(soil.s$transect.name,cols[,1])])
        cols<-cols[order(cols[,1]),]  
        
      #Plot Axis Bounds
        plot(soil.nmds,display="sites",type="n",las=1,ylab="",xaxt="n",yaxt="n",xlab="")
        
      #Add On A Spider linking each point to its population centroid  
        ordispider(soil.nmds,soil.s$transect.name,col="blue",label=F)
        
      #Add the data points  
        with(soil.s,points(soil.nmds,display="sites",pch=21,bg=cols.expand,cex=2))
        
      #Add an ellipse 
          #'ehull' is ellipsoid hull that encloses all points within a group
          #'#?ordiellipse for more options under option 'kind'
        ordiellipse(soil.nmds,soil.s$transect.name,col="blue",label=F,kind="ehull")
        
      #Add Axis Labels  
        mtext(c('NMDS 1', 'NMDS 2'), side=c(1,2), adj=1, cex=1.5,font=2, line=c(1, 0))

        with(soil.s,legend("bottomright",legend=cols[,1],pch=21,pt.bg=as.character(cols[,2]),bty="n",pt.cex=2))          

        ### Note You Can play with the axis limits to frame the plot more accurately


3.4 Statistical Testing Using PERMANOVA

PERMANOVA is a randomisation procedure that will test for the effects of predictors of interest in driving differences in beta diversity / community structure. It runs in vegan, so we need the files we converted for vegan. ‘soil.v’ is our ASV abundance matrix, and the ‘data’ argument needs our sample metadata.

The nice thing about permanova is that it provides r2 values for effects as well as p values, so you can get an idea of % of explained variance and ‘importance’ of effects.

    ######################### TESTING FOR DIFFERENCES WITH PERMANOVA 
        soil.adonis<-adonis(soil.v ~ factor(transect.name) + vegetation ,data=soil.s,permutations=10000,method="bray")
        soil.adonis #pop rsq = ~59% p <0.0001 
## 
## Call:
## adonis(formula = soil.v ~ factor(transect.name) + vegetation,      data = soil.s, permutations = 10000, method = "bray") 
## 
## Permutation: free
## Number of permutations: 10000
## 
## Terms added sequentially (first to last)
## 
##                       Df SumsOfSqs MeanSqs F.Model      R2    Pr(>F)    
## factor(transect.name)  1    0.9159 0.91588  2.1552 0.04172 9.999e-05 ***
## vegetation             1    1.4904 1.49038  3.5071 0.06789 9.999e-05 ***
## Residuals             46   19.5480 0.42496         0.89040              
## Total                 48   21.9543                 1.00000              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      soil.adonis_site<-adonis(soil.v ~ factor(site.name) + factor(transect.name) + vegetation ,data=soil.s,permutations=10000,method="bray")


Transect ID only explains ~4% of our variation in microbial community structure.



3.4.1 TASK

Fit a PERMANOVA model investigating the effect of Extraction Batch on microbial community structure

3.5 UNIFRAC Distances in Phyloseq

An alternative to Bray-Curtis distances among samples is to use the Unifrac distance. Unifrac takes into account the phylogenetic relatedness amongst community members using a sequence tree built from your sequences. ‘Unweighted’ unifrac simply uses genetic distance, whereas ‘Weighted’ unifrac takes into account the relative abundance of each microbial taxon using the abundance table

3.5.1 Unweighted Unifrac

    ######### UNIFRAC ORDINATION PLOTS
            
          ## Unweighted Unifrac  
            ### Ordination using built in functions in phyoseq (calls vegan)    
           ord_nmds_unifrac_unweighted <- ordinate(ps_rare, method="PCoA", distance="unifrac",weighted=FALSE)
## Warning in matrix(tree$edge[order(tree$edge[, 1]), ][, 2], byrow = TRUE, : data
## length [2023] is not a sub-multiple or multiple of the number of rows [1012]
          # #Make Ordination Plot and Store it  
            ord_unifrace_uw1<-plot_ordination(ps_rare, ord_nmds_unifrac_unweighted, color="transect.name", title="Unweighted Unifrac NMDS")
            ord_unifrace_uw1 + stat_ellipse(geom="polygon",aes(fill=transect.name),type="norm",alpha=0.4) + theme_bw()

3.5.2 Weighted Unifrac

          ## Weighted Unifrac  
            ### Ordination using built in functions in phyoseq (calls vegan)    
            ord_nmds_unifrac_weighted <- ordinate(ps_rare, method="PCoA", distance="unifrac",weighted=TRUE)
## Warning in matrix(tree$edge[order(tree$edge[, 1]), ][, 2], byrow = TRUE, : data
## length [2023] is not a sub-multiple or multiple of the number of rows [1012]
          #Make Ordination Plot and Store it  
            ord_unifrace_w1<-plot_ordination(ps_rare, ord_nmds_unifrac_weighted, color="transect.name", title="Weighted Unifrac NMDS")
            ord_unifrace_w1 + stat_ellipse(geom="polygon",aes(fill=transect.name),type="norm",alpha=0.4) + theme_bw()


3.5.3 Alternative UNIFRAC calculations with rbiom


      library(rbiom)
## 
## Attaching package: 'rbiom'
## The following object is masked from 'package:dplyr':
## 
##     select
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:DESeq2':
## 
##     counts
## The following object is masked from 'package:S4Vectors':
## 
##     metadata
## The following object is masked from 'package:BiocGenerics':
## 
##     counts
## The following object is masked from 'package:vegan':
## 
##     rarefy
## The following objects are masked from 'package:phyloseq':
## 
##     nsamples, ntaxa, sample.names
  #Strip Out the ASV matrix and Tree from Our Phyloseq Object
 counts = otu_table(ps_rare)
      tree = phy_tree(ps_rare)
      
  #We can calculate weighted and unqieghted UNIFRAC like so:    
      rbiom_weighted = rbiom::unifrac(counts, weighted=TRUE, tree=tree)
      rbiom_unweighted = rbiom::unifrac(counts, weighted=FALSE, tree=tree)
      
  ## And then perform calculations like PERMANOVA and Ordinations on them in VEGAN    
      soil.s<-as(sample_data(ps_rare),"data.frame") # strip out the sample data 
      
  ##PERMANOVA - Are Sites DIfferent? 
       permanova_site = adonis(
        formula=rbiom_weighted ~ site.name, 
        data=soil.s,
        permutations=999,
      )
       permanova_site #58 of variance explained!
## 
## Call:
## adonis(formula = rbiom_weighted ~ site.name, data = soil.s, permutations = 999) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs  MeanSqs F.Model      R2 Pr(>F)    
## site.name 16    1.2887 0.080545  2.9497 0.59594  0.001 ***
## Residuals 32    0.8738 0.027306         0.40406           
## Total     48    2.1625                  1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## Weighted UNIFRAC ordination 
       soil_mds_weighted = metaMDS(
       as.dist(rbiom_weighted)
      )
## Run 0 stress 0.1374004 
## Run 1 stress 0.1592184 
## Run 2 stress 0.1654294 
## Run 3 stress 0.1497536 
## Run 4 stress 0.1502705 
## Run 5 stress 0.1546879 
## Run 6 stress 0.1553554 
## Run 7 stress 0.1573829 
## Run 8 stress 0.1397793 
## Run 9 stress 0.1542538 
## Run 10 stress 0.137744 
## ... Procrustes: rmse 0.01497263  max resid 0.09721065 
## Run 11 stress 0.157242 
## Run 12 stress 0.137363 
## ... New best solution
## ... Procrustes: rmse 0.002410947  max resid 0.01211236 
## Run 13 stress 0.1689045 
## Run 14 stress 0.1614346 
## Run 15 stress 0.150911 
## Run 16 stress 0.4021143 
## Run 17 stress 0.1378266 
## ... Procrustes: rmse 0.01538004  max resid 0.09684462 
## Run 18 stress 0.1498033 
## Run 19 stress 0.1462731 
## Run 20 stress 0.1497546 
## *** No convergence -- monoMDS stopping criteria:
##     19: stress ratio > sratmax
##      1: scale factor of the gradient < sfgrmin
  #Plot with Vegan 
      plot(soil_mds_weighted,display="sites",type="n",las=1,ylab="",xaxt="n",yaxt="n",xlab="")
          #Add On A Spider linking each point to its population centroid  
            ordispider(soil_mds_weighted,soil.s$transect.name,col="blue",label=T)
         #Add the data points  
          with(soil.s,points(soil_mds_weighted,display="sites",pch=21,cex=2,bg="white"))

 ###### More COmplex Model 
       permanova_site_transect = adonis(
        formula=rbiom_weighted ~ transect.name +  vegetation + site.name, 
        data=soil.s,
        permutations=999,
      )
       permanova_site_transect #58 of variance explained!
## 
## Call:
## adonis(formula = rbiom_weighted ~ transect.name + vegetation +      site.name, data = soil.s, permutations = 999) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##               Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## transect.name  1   0.12840 0.12840  4.7023 0.05938  0.001 ***
## vegetation     1   0.35151 0.35151 12.8730 0.16255  0.001 ***
## site.name     14   0.80881 0.05777  2.1157 0.37401  0.001 ***
## Residuals     32   0.87380 0.02731         0.40406           
## Total         48   2.16252                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
       ###### More COmplex Model 
       permanova_site_transect_noveg = adonis(
        formula=rbiom_weighted ~  site.name + transect.name , 
        data=soil.s,
        permutations=999,
      )

       permanova_site_transect_noveg
## 
## Call:
## adonis(formula = rbiom_weighted ~ site.name + transect.name,      data = soil.s, permutations = 999) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs  MeanSqs F.Model      R2 Pr(>F)    
## site.name 16    1.2887 0.080545  2.9497 0.59594  0.001 ***
## Residuals 32    0.8738 0.027306         0.40406           
## Total     48    2.1625                  1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


3.6 CCA Models

When we fit NMDS models to microbiome data, we’re making no assumptions about how the data correspond to sample groups of interest. Rather, we ordinate samples in n-dimensional space, corresponding to the number of axes we choose to ordinate to, and then overlay our sample metadata on top to look for patterns.

CCA, or Constrained Correspondence Analysis, does something different. We can fit a model to the ASV / distance matrix and attempt to constrain variation in our data into that explained by our variables of interest. The second step is to ordinate the unconstrained residual information in the data. The advantage is that we get some statistics about our variables of interest alongside the ordination. Let’s repeat some analysis above using CCA

soil_cca<- cca(soil.v ~ transect.name + site.name,data=soil.s)
soil_cca
## Call: cca(formula = soil.v ~ transect.name + site.name, data = soil.s)
## 
##               Inertia Proportion Rank
## Total         26.0296     1.0000     
## Constrained   10.1162     0.3886   16
## Unconstrained 15.9134     0.6114   32
## Inertia is scaled Chi-square 
## Some constraints were aliased because they were collinear (redundant)
## 
## Eigenvalues for constrained axes:
##   CCA1   CCA2   CCA3   CCA4   CCA5   CCA6   CCA7   CCA8   CCA9  CCA10  CCA11 
## 0.9549 0.9008 0.8345 0.7722 0.7169 0.6921 0.6588 0.6333 0.6325 0.6007 0.5319 
##  CCA12  CCA13  CCA14  CCA15  CCA16 
## 0.5228 0.5091 0.4671 0.3594 0.3293 
## 
## Eigenvalues for unconstrained axes:
##    CA1    CA2    CA3    CA4    CA5    CA6    CA7    CA8 
## 0.8691 0.8466 0.8218 0.7824 0.6930 0.6532 0.6174 0.6063 
## (Showing 8 of 32 unconstrained eigenvalues)

We use the ‘Anova’ function in the ‘car’ package, which changes something called the ‘sum of squares’ to makes the results insensitive to the order they’re added. This is super useful, and makes the results more reproducible.

car::Anova(soil_cca)
## Analysis of Deviance Table (Type II tests)
## 
## Response: soil.v
##               Df    Chisq Pr(>Chisq)    
## transect.name  1    74.01  < 2.2e-16 ***
## site.name     15 40414.14  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


3.6.1 TASK

Fit a model to test for differences among vegetation types using CCA.

3.6.2 CCA with Continuous Variables

These models also work with continuous predictors. Here we look at the effect of pH. First we fit the model (replacing some missing values with the mean of pH first):


 soil.s$ph[which(is.na(soil.s$ph))]<-mean(soil.s$ph,na.rm=T)
    soil_env<- cca(soil.v ~ elevation+ ph, data=soil.s)


Then we can inspect the results, as above:

soil_env
## Call: cca(formula = soil.v ~ elevation + ph, data = soil.s)
## 
##                Inertia Proportion Rank
## Total         26.02958    1.00000     
## Constrained    1.56568    0.06015    2
## Unconstrained 24.46390    0.93985   46
## Inertia is scaled Chi-square 
## 
## Eigenvalues for constrained axes:
##   CCA1   CCA2 
## 0.8700 0.6957 
## 
## Eigenvalues for unconstrained axes:
##    CA1    CA2    CA3    CA4    CA5    CA6    CA7    CA8 
## 0.9634 0.8815 0.8576 0.8359 0.8003 0.7890 0.7767 0.7685 
## (Showing 8 of 46 unconstrained eigenvalues)


And test for the effects in the model:

car::Anova(soil_env)
## Analysis of Deviance Table (Type II tests)
## 
## Response: soil.v
##           Df   Chisq Pr(>Chisq)    
## elevation  1 950.152  < 2.2e-16 ***
## ph         1  17.321  3.157e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(soil_cca)

3.6.3 TASK

> Fit a CCA model investigating the effect of Average Soil Relative Humidity on microbial community structure. Compare the results and output here to that from PERMANOVA
  

3.7 Within-Group Divergence

The ‘microbiome’ package has some useful functions for investigating beta diversity traits of samples. There’s a nice guide here. Here we’ll calculate ‘divergence’ - a measure of how much variation there is in microbiome structure within, versus among groups of interest. Some groups might all have very similar microbiome profiles (low divergence), and others very distinct profiles. Differences in divergence among groups might tell us if there are population specific processes shaping microbiome structure.
Here I refer to groups as (statistical) populations, but they could be anything. Here we’ll look at microbiome divergence within sites


 ##################### CALCULATE WITHIN POPULATION DIVERGENCE
      
        #Calculate Within-Pop Divergence
        library(microbiome)
        
      #Extract Population Info   
        popvals<-as.character(unique(sample_data(ps_rare)$transect.name))
        popvals
## [1] "Baquedano" "Yungay"
      ## Make an Empty List to Store Divergence Values    
      div.list<-rep(list(NA),length(popvals))
        for(k in 1:length(popvals)){
          
        #Subset to Reference Group 
          div.temp<-subset_samples(ps_rare,transect.name==popvals[k])
          
        #Calculate Median for Each ASV 
          div.reference <- apply(abundances(div.temp), 1, mean) # median often used instead but this dataset is funny
          
          div.list[[k]]<-divergence(div.temp,div.reference,method="bray")
        }
        
      #Expand Divergence Values from List to Data Frame  
        divergence_values<-data.frame(divergence=unlist(div.list),Sample=names(unlist(div.list)))
        
    #Add On Divergence Values    
        soil.s$Sample<-rownames(soil.s)
        soil_divergence<-left_join(soil.s,divergence_values,"Sample")
        
    #Plot Divergence By Pop 
        ggplot(soil_divergence,aes(x=transect.name,y=divergence)) + geom_violin(aes(fill=transect.name))



3.8 Network Analysis


Network analysis allows us to visualise co-occurrence of microbial taxa, or similarity of samples based on microbial community profiles. First, let’s make a network of microbial taxa. Normally, we’d only want to present a subset of our microbial taxa so the network doesn’t become too unwieldy - but we’ve already subsetted the data to only taxa with >100 reads for our analysis

ig <- make_network(ps_rare, "taxa", distance = "jaccard", max.dist = 0.95)
plot_network(ig, ps_rare, type="taxa", point_size = 5, label=NULL, color="Class", line_alpha = 0.05)

Next we’ll do the same, but with our samples. The advantage here is that we can overlay important sample metadata over the top of sample points to see if these data predict network similarity / structure. We’ll do that here with the Vegetation variable.

ig <- make_network(ps_rare, "samples", distance = "jaccard", max.dist = 0.95)
plot_network(ig, ps_rare, type="samples", point_size = 5, label=NULL, color="vegetation", line_alpha = 0.05)

data(enterotype)
ig <- make_network(enterotype, max.dist=0.3,distance="bray")
plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.3, label=NULL)


3.8.1 TASK

Adapt the code above to vary the ‘max.dist’ threshold and see how the network changes at different levels of stringency

SECTION 4: ACCOUNTING FOR COMPOSITIONAL NATURE OF MICROBIOME DATA

4.1 Centred-Log Ratio Transformation of Data

############### BETA DIVERISTY PLOT AND TESTS BASED ON CLR-TRANSFORMED DATA OF UNRAREFIED LIBRARY SIZES (XAH)
        
        
        #### Vegan Function to Convert Phyloseq into Something Vegan can call directly
        vegan_otu <- function(physeq) {
          OTU <- otu_table(physeq)
          if (taxa_are_rows(OTU)) {
            OTU <- t(OTU)
          }
          return(as(OTU, "matrix"))
        }
        
        ########### Transform Sample Counts
          library(microbiome)
        
              #CLR Transform  on unrarefied object
              ps_clr <- microbiome::transform(physeq100, "clr")
              head(otu_table(ps_clr))
## OTU Table:          [6 taxa and 54 samples]
##                      taxa are rows
##                                  BAQ2420.1.1 BAQ2420.1.2 BAQ2420.1.3  BAQ2420.2
## 9e5c513fb56d18593fe6a31808b722d8  -0.2212132  -0.3628818  -0.2650607 -0.3584369
## fce1dd8b549c362b7abf3d12a9baba91  -0.2212132  -0.3628818  -0.2650607 -0.3584369
## e7e19b672a061e119437a7cf815db964   3.1676181  -0.3628818  -0.2650607 -0.3584369
## 5eca5c84925c78b71899493b972dcc65  -0.2212132  -0.3628818  -0.2650607 -0.3584369
## c3bd77a0f6486c74b4b713ce8a5f285b  -0.2212132  -0.3628818  -0.2650607 -0.3584369
## bdb152823a97308c1f049837f0cf13db  -0.2212132  -0.3628818  -0.2650607 -0.3584369
##                                   BAQ2420.3  BAQ2462.1  BAQ2462.2  BAQ2462.3
## 9e5c513fb56d18593fe6a31808b722d8 -0.3375948 -0.3178584 -0.3252743 -0.2947759
## fce1dd8b549c362b7abf3d12a9baba91 -0.3375948 -0.3178584 -0.3252743 -0.2947759
## e7e19b672a061e119437a7cf815db964 -0.3375948 -0.3178584 -0.3252743 -0.2947759
## 5eca5c84925c78b71899493b972dcc65 -0.3375948 -0.3178584 -0.3252743 -0.2947759
## c3bd77a0f6486c74b4b713ce8a5f285b -0.3375948 -0.3178584 -0.3252743 -0.2947759
## bdb152823a97308c1f049837f0cf13db -0.3375948 -0.3178584 -0.3252743 -0.2947759
##                                   BAQ2687.1  BAQ2687.2  BAQ2687.3  BAQ2838.1
## 9e5c513fb56d18593fe6a31808b722d8 -0.2658175 -0.2173518 -0.1977687 -0.2473847
## fce1dd8b549c362b7abf3d12a9baba91 -0.2658175 -0.2173518 -0.1977687 -0.2473847
## e7e19b672a061e119437a7cf815db964 -0.2658175 -0.2173518 -0.1977687 -0.2473847
## 5eca5c84925c78b71899493b972dcc65 -0.2658175 -0.2173518 -0.1977687 -0.2473847
## c3bd77a0f6486c74b4b713ce8a5f285b -0.2658175 -0.2173518 -0.1977687 -0.2473847
## bdb152823a97308c1f049837f0cf13db -0.2658175 -0.2173518 -0.1977687 -0.2473847
##                                   BAQ2838.2  BAQ2838.3  BAQ3473.1  BAQ3473.2
## 9e5c513fb56d18593fe6a31808b722d8 -0.2221959 -0.1689595 -0.3084103 -0.3639087
## fce1dd8b549c362b7abf3d12a9baba91 -0.2221959 -0.1689595 -0.3084103 -0.3639087
## e7e19b672a061e119437a7cf815db964 -0.2221959 -0.1689595 -0.3084103 -0.3639087
## 5eca5c84925c78b71899493b972dcc65 -0.2221959 -0.1689595 -0.3084103 -0.3639087
## c3bd77a0f6486c74b4b713ce8a5f285b -0.2221959 -0.1689595 -0.3084103 -0.3639087
## bdb152823a97308c1f049837f0cf13db -0.2221959 -0.1689595 -0.3084103 -0.3639087
##                                   BAQ3473.3 BAQ4166.1.1 BAQ4166.1.2 BAQ4166.1.3
## 9e5c513fb56d18593fe6a31808b722d8 -0.1665336  -0.2747831  -0.3323028  -0.2839084
## fce1dd8b549c362b7abf3d12a9baba91 -0.1665336  -0.2747831  -0.3323028  -0.2839084
## e7e19b672a061e119437a7cf815db964 -0.1665336  -0.2747831  -0.3323028  -0.2839084
## 5eca5c84925c78b71899493b972dcc65 -0.1665336  -0.2747831   3.4667742  -0.2839084
## c3bd77a0f6486c74b4b713ce8a5f285b  4.8160953  -0.2747831  -0.3323028  -0.2839084
## bdb152823a97308c1f049837f0cf13db -0.1665336  -0.2747831  -0.3323028  -0.2839084
##                                   BAQ4166.2  BAQ4166.3 BAQ4697.1 BAQ4697.2
## 9e5c513fb56d18593fe6a31808b722d8 -0.3082617 -0.3856309 -0.374190 -0.307527
## fce1dd8b549c362b7abf3d12a9baba91 -0.3082617 -0.3856309 -0.374190 -0.307527
## e7e19b672a061e119437a7cf815db964 -0.3082617 -0.3856309 -0.374190 -0.307527
## 5eca5c84925c78b71899493b972dcc65 -0.3082617 -0.3856309  2.050374 -0.307527
## c3bd77a0f6486c74b4b713ce8a5f285b -0.3082617 -0.3856309 -0.374190 -0.307527
## bdb152823a97308c1f049837f0cf13db -0.3082617 -0.3856309 -0.374190 -0.307527
##                                   BAQ4697.3 YUN1005.1.1  YUN1005.3  YUN1242.1
## 9e5c513fb56d18593fe6a31808b722d8 -0.3488606  -0.1893163 -0.2056995 -0.1805398
## fce1dd8b549c362b7abf3d12a9baba91 -0.3488606  -0.1893163 -0.2056995 -0.1805398
## e7e19b672a061e119437a7cf815db964 -0.3488606  -0.1893163 -0.2056995 -0.1805398
## 5eca5c84925c78b71899493b972dcc65 -0.3488606  -0.1893163 -0.2056995 -0.1805398
## c3bd77a0f6486c74b4b713ce8a5f285b -0.3488606  -0.1893163 -0.2056995 -0.1805398
## bdb152823a97308c1f049837f0cf13db -0.3488606  -0.1893163 -0.2056995 -0.1805398
##                                   YUN1242.3  YUN1609.1  YUN2029.2  YUN3153.2
## 9e5c513fb56d18593fe6a31808b722d8 -0.1865694 -0.1983486 -0.2321907 -0.2574349
## fce1dd8b549c362b7abf3d12a9baba91 -0.1865694 -0.1983486 -0.2321907 -0.2574349
## e7e19b672a061e119437a7cf815db964 -0.1865694 -0.1983486 -0.2321907 -0.2574349
## 5eca5c84925c78b71899493b972dcc65 -0.1865694 -0.1983486 -0.2321907 -0.2574349
## c3bd77a0f6486c74b4b713ce8a5f285b -0.1865694 -0.1983486 -0.2321907 -0.2574349
## bdb152823a97308c1f049837f0cf13db -0.1865694 -0.1983486 -0.2321907 -0.2574349
##                                   YUN3153.3 YUN3259.1.1 YUN3259.1.2 YUN3259.1.3
## 9e5c513fb56d18593fe6a31808b722d8 -0.2610856  -0.0835811   -0.340611  -0.1745068
## fce1dd8b549c362b7abf3d12a9baba91 -0.2610856  -0.0835811   -0.340611  -0.1745068
## e7e19b672a061e119437a7cf815db964 -0.2610856  -0.0835811   -0.340611  -0.1745068
## 5eca5c84925c78b71899493b972dcc65 -0.2610856  -0.0835811   -0.340611  -0.1745068
## c3bd77a0f6486c74b4b713ce8a5f285b -0.2610856  -0.0835811   -0.340611  -0.1745068
## bdb152823a97308c1f049837f0cf13db -0.2610856  -0.0835811   -0.340611  -0.1745068
##                                   YUN3259.2  YUN3259.3  YUN3346.1  YUN3346.2
## 9e5c513fb56d18593fe6a31808b722d8 -0.2920226 -0.3296262 -0.3290151 -0.2128194
## fce1dd8b549c362b7abf3d12a9baba91 -0.2920226 -0.3296262 -0.3290151 -0.2128194
## e7e19b672a061e119437a7cf815db964 -0.2920226 -0.3296262 -0.3290151 -0.2128194
## 5eca5c84925c78b71899493b972dcc65 -0.2920226 -0.3296262 -0.3290151 -0.2128194
## c3bd77a0f6486c74b4b713ce8a5f285b -0.2920226 -0.3296262 -0.3290151 -0.2128194
## bdb152823a97308c1f049837f0cf13db -0.2920226 -0.3296262 -0.3290151 -0.2128194
##                                   YUN3346.3  YUN3428.1 YUN3428.2  YUN3428.3
## 9e5c513fb56d18593fe6a31808b722d8 -0.2893778 -0.3039452  2.474385  3.4726658
## fce1dd8b549c362b7abf3d12a9baba91 -0.2893778 -0.3039452 -0.432200 -0.3082498
## e7e19b672a061e119437a7cf815db964 -0.2893778  3.6273357 -0.432200 -0.3082498
## 5eca5c84925c78b71899493b972dcc65 -0.2893778 -0.3039452 -0.432200 -0.3082498
## c3bd77a0f6486c74b4b713ce8a5f285b -0.2893778 -0.3039452 -0.432200 -0.3082498
## bdb152823a97308c1f049837f0cf13db -0.2893778  2.3195337  2.775787 -0.3082498
##                                  YUN3533.1.1 YUN3533.1.2 YUN3533.1.3  YUN3533.2
## 9e5c513fb56d18593fe6a31808b722d8  -0.1820351  -0.3002894  -0.3051522 -0.3547928
## fce1dd8b549c362b7abf3d12a9baba91  -0.1820351  -0.3002894  -0.3051522  1.8668232
## e7e19b672a061e119437a7cf815db964  -0.1820351  -0.3002894  -0.3051522  2.3967425
## 5eca5c84925c78b71899493b972dcc65  -0.1820351  -0.3002894  -0.3051522 -0.3547928
## c3bd77a0f6486c74b4b713ce8a5f285b  -0.1820351  -0.3002894  -0.3051522 -0.3547928
## bdb152823a97308c1f049837f0cf13db  -0.1820351  -0.3002894  -0.3051522  2.2101566
##                                   YUN3533.3 YUN3856.1.1 YUN3856.1.2 YUN3856.1.3
## 9e5c513fb56d18593fe6a31808b722d8 -0.3293196  -0.1991167  -0.3602824  -0.3349785
## fce1dd8b549c362b7abf3d12a9baba91 -0.3293196  -0.1991167   2.6967911  -0.3349785
## e7e19b672a061e119437a7cf815db964 -0.3293196  -0.1991167  -0.3602824  -0.3349785
## 5eca5c84925c78b71899493b972dcc65 -0.3293196  -0.1991167  -0.3602824  -0.3349785
## c3bd77a0f6486c74b4b713ce8a5f285b -0.3293196  -0.1991167  -0.3602824  -0.3349785
## bdb152823a97308c1f049837f0cf13db -0.3293196  -0.1991167  -0.3602824  -0.3349785
##                                   YUN3856.2  YUN3856.3
## 9e5c513fb56d18593fe6a31808b722d8 -0.3453369 -0.4737605
## fce1dd8b549c362b7abf3d12a9baba91 -0.3453369  2.3833124
## e7e19b672a061e119437a7cf815db964 -0.3453369 -0.4737605
## 5eca5c84925c78b71899493b972dcc65 -0.3453369 -0.4737605
## c3bd77a0f6486c74b4b713ce8a5f285b -0.3453369 -0.4737605
## bdb152823a97308c1f049837f0cf13db -0.3453369 -0.4737605

4.2 Ordination of CLR-transformed Data Using PCA

      #Extract Matrix and Sample Data  
        ps_clr_v<-vegan_otu(ps_clr)
        ps_clr_s<-as(sample_data(ps_clr),"data.frame")
        
        ######## Principal Components Analysis
          ps_pc<-prcomp(ps_clr_v)
          
              #Cool Biplot Showing How Diff ASVs affect the primary axes of the ordinatiton
                  biplot(ps_pc)

              #Scree plot of relative importance of explained by each axis
                  plot(ps_pc)

              #Variance Explained by FIrst 2 axes  
                  summary(ps_pc)$importance[,1:2] # 
##                             PC1      PC2
## Standard deviation     3.579576 2.817961
## Proportion of Variance 0.119370 0.073970
## Cumulative Proportion  0.119370 0.193340
          #### Extract Scores for Plotting        
              library(vegan)
              ps_pc_scores<-scores(ps_pc)
              ps_pc_scores_sub<-ps_pc_scores[,1:2]
            #Add Sample Data  
              ps_pc_scores_sub<-cbind(ps_pc_scores_sub,ps_clr_s)

4.3 Plots of PCA Ordinations of CLR-transformed Data

        ######### Plot
            library(ggplot2)
            library(cowplot)
            
          #Housekeeping + Label Setup    
           # ps_pc_scores_sub$Species<-as.factor(ps_pc_scores_sub$Full_names)
            
          
         #Plot    
            pca1<-ggplot(ps_pc_scores_sub,aes(x=PC1,y=PC2)) + stat_ellipse(type="t",aes(color=transect.name),level = 0.95,alpha=0.5) + geom_point(aes(colour=transect.name),size=5) 
            pca2<-pca1 + theme_bw() + labs(fill="Transect Name",x="PC1 (11.7%)",y="PC2 (7.3%)") 
            pca3<-pca2 +  theme(legend.position="top",axis.text=element_text(size=18),axis.title=element_text(size=20),legend.text = element_text(size=12),legend.title = element_text(size=18))
               pca3

4.4 PERMANOVA on CLR-Transformed Data

Here we’ll re-do the CLR transform above, but makign sure that we clean the data adequately to include only complete cases of some variables, and make sure there are no taxa with zero counts. There _shouldn’t be any, because we’re using our stringently-subsetted physeq100 dataset, but it’s good data practice.

############### PERMANOVA ON CLR-Transformed DISTANCES
            
        
    ### Need Complete Cases for Metadata
        
        #Subset ECU to Complete Cases for Breedign Mode (as eg)
          physeq_complete<-prune_samples(!is.na(sample_data(physeq100)$transect.name),physeq100)
          
        #Can't Have Zero-Sum taxa for distances 
          physeq_complete_nonzero<-prune_taxa(taxa_sums(physeq_complete)>0,physeq_complete)
          
        #CLR Transform  on unrarefied object
          physeq_complete_clr <- microbiome::transform(physeq_complete_nonzero, "clr")

        #Extract Matrix and Sample Data  
          physeq_complete_clr_v<-vegan_otu(physeq_complete_clr)
          physeq_complete_clr_s<-as(sample_data(physeq_complete_clr),"data.frame")
    
  ## Perform PERMANOVA        
    ps_clr_perm <- adonis2(physeq_complete_clr_v ~ site.name,data=physeq_complete_clr_s, permutations=999,method="euclidean")
    ps_clr_perm
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = physeq_complete_clr_v ~ site.name, data = physeq_complete_clr_s, permutations = 999, method = "euclidean")
##           Df SumOfSqs      R2      F Pr(>F)    
## site.name 16   2885.6 0.50719 2.3799  0.001 ***
## Residual  37   2803.8 0.49281                  
## Total     53   5689.3 1.00000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.4.1 TASK

Compare the PERMANOVA model above of CLR-transformed data with a ‘standard’ model from the same ‘physeq100’ dataset fitted to rarefied data.

SECTION 5: ADVANCED STATISTICAL MODELS

Sometimes we might want to calculate derived quantities of microbiome community composition, like alpha diversity, and use them in more classical statistical models like GLMs or GLMMs. One reason for this is because the data might be inherently structured. In the case of the soil dataset, sampling took place at multiple sites, which themselves were nested across a wider spatial scale of 2 transects, so we might want to account for that in our models.

Here we’ll run some models to look at predictors of alpha diversity in our dataset. We’ll use the code we used to generate use our soil richness data from above, copied below for reference.

      soil_richness<-estimate_richness(ps_rare,measures=c("Observed","Shannon","InvSimpson"))
         #Add Richness Onto Our Metadata
         soil_richness$Sample<-rownames(soil_richness)
         soil_richness<-left_join(soil_richness,soil_meta,"Sample")

5.1 Fitting a Mixed Effects Model

Now we have everything in one dataframe for fitting models to. Here we’ll fit a model where we try and explain the factors predicting alpha diversity. We will use:

  • Shannon diversity as the response variable, with a Gaussian error structure

  • Vegatation interacting with Transect as FIXED effects

  • Site ID as a RANDOM effect - to account for the fact that there is non-independence among samples taken from the same site

    Note that if any of these variables have missing data (NA values), then R will drop them from the model. But for model comparison we need to make sure all models are fitted to the same number of data

  library(lme4)

  #Complete Dataset 
    soil_richness_complete<-soil_richness[complete.cases(soil_richness[,c("vegetation","average.soil.relative.humidity","transect.name","site.name")]),]

  #Fit Model 
    m1<-lmer(Shannon ~   average.soil.relative.humidity + vegetation + transect.name + (1|site.name),data=soil_richness_complete)
      summary(m1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## Shannon ~ average.soil.relative.humidity + vegetation + transect.name +  
##     (1 | site.name)
##    Data: soil_richness_complete
## 
## REML criterion at convergence: 34
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.60366 -0.52589  0.03536  0.55380  2.40288 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  site.name (Intercept) 0.002139 0.04625 
##  Residual              0.084071 0.28995 
## Number of obs: 46, groups:  site.name, 16
## 
## Fixed effects:
##                                 Estimate Std. Error t value
## (Intercept)                     2.460270   0.177941  13.826
## average.soil.relative.humidity  0.008338   0.002691   3.098
## vegetationyes                   0.041082   0.146510   0.280
## transect.nameYungay            -0.231647   0.095288  -2.431
## 
## Correlation of Fixed Effects:
##             (Intr) avr... vgttny
## avrg.sl.rl. -0.878              
## vegetatinys  0.496 -0.773       
## trnsct.nmYn -0.498  0.299 -0.312
      m1_glm<-glm(Shannon ~   average.soil.relative.humidity,data=soil_richness_complete)
    summary(m1_glm)
## 
## Call:
## glm(formula = Shannon ~ average.soil.relative.humidity, data = soil_richness_complete)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.87984  -0.14419   0.01262   0.20905   0.61334  
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    2.280601   0.145389  15.686  < 2e-16 ***
## average.soil.relative.humidity 0.009275   0.001750   5.301 3.55e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.09458495)
## 
##     Null deviance: 6.8192  on 45  degrees of freedom
## Residual deviance: 4.1617  on 44  degrees of freedom
## AIC: 26.018
## 
## Number of Fisher Scoring iterations: 2

We can inspect our model using ‘summary’

summary(m1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## Shannon ~ average.soil.relative.humidity + vegetation + transect.name +  
##     (1 | site.name)
##    Data: soil_richness_complete
## 
## REML criterion at convergence: 34
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.60366 -0.52589  0.03536  0.55380  2.40288 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  site.name (Intercept) 0.002139 0.04625 
##  Residual              0.084071 0.28995 
## Number of obs: 46, groups:  site.name, 16
## 
## Fixed effects:
##                                 Estimate Std. Error t value
## (Intercept)                     2.460270   0.177941  13.826
## average.soil.relative.humidity  0.008338   0.002691   3.098
## vegetationyes                   0.041082   0.146510   0.280
## transect.nameYungay            -0.231647   0.095288  -2.431
## 
## Correlation of Fixed Effects:
##             (Intr) avr... vgttny
## avrg.sl.rl. -0.878              
## vegetatinys  0.496 -0.773       
## trnsct.nmYn -0.498  0.299 -0.312

5.2 Model Selection Using AIC

First, we convert our model from REML (REstricted Maximum Likelihood) to ML (Maximum Likelihood) to allow us to compare models with different fixed effects structure. Then we’ll rank our model by AICc to compare model fit whilst also penalising models for their complexity.

Model selection philosophy is a complex issue. We’re using Information Theory here (AIC and related information critera), but other options are available, such as frequentist based p values, or Bayesian models and associated model selection. This paper here should proivde a solid foundation to the various approaches available to using mixed effects models in R.

  library(MuMIn)

  #Convert model to ML
    m1_ml<-update(m1,REML=F)
## boundary (singular) fit: see ?isSingular
    summary(m1_ml)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: 
## Shannon ~ average.soil.relative.humidity + vegetation + transect.name +  
##     (1 | site.name)
##    Data: soil_richness_complete
## 
##      AIC      BIC   logLik deviance df.resid 
##     25.4     36.3     -6.7     13.4       40 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.76877 -0.56083  0.02674  0.58486  2.56802 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  site.name (Intercept) 0.00000  0.0000  
##  Residual              0.07828  0.2798  
## Number of obs: 46, groups:  site.name, 16
## 
## Fixed effects:
##                                 Estimate Std. Error t value
## (Intercept)                     2.461706   0.166379  14.796
## average.soil.relative.humidity  0.008288   0.002501   3.314
## vegetationyes                   0.043305   0.135200   0.320
## transect.nameYungay            -0.230945   0.088105  -2.621
## 
## Correlation of Fixed Effects:
##             (Intr) avr... vgttny
## avrg.sl.rl. -0.882              
## vegetatinys  0.496 -0.769       
## trnsct.nmYn -0.489  0.297 -0.318
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see ?isSingular
  #Set Global Options to Fail in the Case of NAs - done to rpevent models fitted to different amounts of data 
    options(na.action = "na.fail")
    
  ###### P Value 
    m1_null<-update(m1_ml,~.-average.soil.relative.humidity)
    summary(m1_null)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: Shannon ~ vegetation + transect.name + (1 | site.name)
##    Data: soil_richness_complete
## 
##      AIC      BIC   logLik deviance df.resid 
##     32.9     42.0    -11.4     22.9       41 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2023 -0.6752 -0.0686  0.4926  2.4172 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  site.name (Intercept) 0.008903 0.09435 
##  Residual              0.088297 0.29715 
## Number of obs: 46, groups:  site.name, 16
## 
## Fixed effects:
##                     Estimate Std. Error t value
## (Intercept)          2.93765    0.09697  30.295
## vegetationyes        0.40015    0.10478   3.819
## transect.nameYungay -0.32517    0.10346  -3.143
## 
## Correlation of Fixed Effects:
##             (Intr) vgttny
## vegetatinys -0.592       
## trnsct.nmYn -0.544 -0.108
    anova(m1_null,m1_ml)
## Data: soil_richness_complete
## Models:
## m1_null: Shannon ~ vegetation + transect.name + (1 | site.name)
## m1_ml: Shannon ~ average.soil.relative.humidity + vegetation + transect.name + (1 | site.name)
##         npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)   
## m1_null    5 32.875 42.018 -11.438   22.875                        
## m1_ml      6 25.360 36.332  -6.680   13.360 9.5153  1   0.002038 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  #Rank by AICc 
    library(MuMIn)
    m1_aic_rank<-dredge(m1_ml)
## Fixed term is "(Intercept)"
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular

Deciding Which Models to Use We can insepct our ranked models as follows

m1_aic_rank
## Global model call: lmer(formula = Shannon ~ average.soil.relative.humidity + vegetation + 
##     transect.name + (1 | site.name), data = soil_richness_complete, 
##     REML = F)
## ---
## Model selection table 
##   (Int) avr.sol.rlt.hmd trn.nam vgt df  logLik AICc delta weight
## 4 2.435        0.008904       +      5  -6.731 25.0  0.00  0.674
## 8 2.462        0.008288       +   +  6  -6.680 27.5  2.55  0.188
## 2 2.272        0.009387              4  -9.903 28.8  3.82  0.100
## 6 2.240        0.010370           +  5  -9.787 31.1  6.11  0.032
## 7 2.938                       +   +  5 -11.438 34.4  9.41  0.006
## 5 2.737                           +  4 -15.321 39.6 14.65  0.000
## 3 3.149                       +      4 -16.852 42.7 17.72  0.000
## 1 2.947                              3 -18.944 44.5 19.50  0.000
## Models ranked by AICc(x) 
## Random terms (all models): 
## '1 | site.name'

We can see that the ‘top’ model, with the lowest AICc score, contains only the MAIN effects of Vegetation and Transect ID. There is onyl 1 other model within 6 AIC units of the top model, which is widely regarded as the cutoff one should use.

However, this model is a more complex versions of the top model - they contain more parameters but have a poorer ‘fit to the data ’fit-complexity’ tradeoff. Under the nesting rule, we can ignore these models and not use them for inference, leaving us with just the top model. This makes life much simpler. Note that if we had used p values for simplification, we might have arrived at the same conclusion - where only Relative Humidity is retained for inference.

5.3 Calculating Explained Variance for Models

The R package MuMIn has some handy functions for calculating an analogue of r2 for mixed models

#Re-Fit Top Model 
    m_top<-lmer(Shannon ~  average.soil.relative.humidity  + (1|site.name),data=soil_richness_complete)
    summary(m_top)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Shannon ~ average.soil.relative.humidity + (1 | site.name)
##    Data: soil_richness_complete
## 
## REML criterion at convergence: 34.8
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.66108 -0.45066  0.02597  0.62491  1.91465 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  site.name (Intercept) 0.009864 0.09932 
##  Residual              0.085594 0.29256 
## Number of obs: 46, groups:  site.name, 16
## 
## Fixed effects:
##                                Estimate Std. Error t value
## (Intercept)                    2.267905   0.152811  14.841
## average.soil.relative.humidity 0.009447   0.001880   5.025
## 
## Correlation of Fixed Effects:
##             (Intr)
## avrg.sl.rl. -0.943
#Calculate r2
    r.squaredGLMM(m_top)
## Warning: 'r.squaredGLMM' now calculates a revised statistic. See the help page.
##            R2m       R2c
## [1,] 0.3909167 0.4538547

Marginal r2 (R2m) is our % variance explained just due to the fixed effects. Conditional r2 (R2c) is variance explained by both fixed and random effects. As you can see here, the values are quite high. This is good!



SECTION 6: INDICATOR ANALYSIS AND DIFFERENTIAL ABUNDANCE TESTING



6.1 Indicator Analysis


Indicator analysis is based on this paper by Dufrene and Legendre in 1997.

The ideal indicator species: - is exclusively faithful to a particular group (EXCLUSIVITY) - occurs in all sample units within a group(FIDELITY)

Here can use the ‘vegan_otu’ function again to extract the ASV table from the phyloseq object, as it puts the ASvs as columns and samples as rows - this is the format we need for indicator analysis.

library(labdsv)

indicator_matrix<- vegan_otu(ps_rare)

Then we need a numeric vector of cluster identities that correspond to a treatment of interest (labdsv can’t just be given a factor). So let’s see if there are indicator species for each vegetation type. This code will convert the 2 level vegetation factor to numeric clusters where WITH vegetation = 1 and NO vegetation = 2.

 veg_numeric<-with(sample_data(ps_rare),ifelse(vegetation=="yes",1,2))

Then we run the indicator analysis by suppling the community matrix and the numeric vector of vegetation types

 #library(labdsv)
      veg_indic<-indval(indicator_matrix,veg_numeric)

#Inspect Results
      summary(veg_indic)
##                                   cluster indicator_value probability
## X6b780e361cfc5f06def718518324bdcb       1          0.6250       0.001
## ef3fdbe1dcde754d91130cde6a4b4d61        1          0.4192       0.017
## X409faa5f5353e543bf6d99125c7c0e83       1          0.3438       0.017
## X5c78314ff92e6fec9aa07acc1fa0dc24       1          0.3438       0.017
## ffd60d684f32e6fd5b47fe90095f9d34        1          0.3008       0.037
## X4c8ff0ea98d2c0ebb486e33bd96c61f9       1          0.2500       0.050
## e81a04a50f341415b6b8dbb1f94c559c        1          0.2500       0.040
## dc8a2f47b3d1dc2e1f5f805891976b29        2          0.6290       0.001
## a7b877ae6d2f079a15b6b192a4425620        2          0.5593       0.002
## a36b38f754f6abd278aeb9dbc7696343        2          0.4605       0.002
## X5eee6273221f15d4e0cdc2c033f0dd72       2          0.3654       0.007
## X6e4d034e4b967b745639f825dc73ebe6       2          0.3529       0.001
## e2b3c5ef478cae4cbabc7577bf139645        2          0.2941       0.003
## X8800bca0dea2ed79f023f3e709b19076       2          0.2353       0.022
## X7452a179b9a24c364642eeaade4d266f       2          0.2353       0.013
## X9279403ab31c9b2574b09cef10f08586       2          0.2353       0.012
## X45e4364ffb37868633b53e176d598b5e       2          0.2353       0.009
## X95f85b31b6d22bec1f34814fa2b4c100       2          0.2353       0.010
## ddbc32632c9632d1dd746f28721f3a9f        2          0.2166       0.027
## X47dc56fdba4dc5de3208a99337000601       2          0.1765       0.038
## X9b86760f1d24e78119093f5f08f45f71       2          0.1765       0.035
## ce6b49bab0851644749e892e745d50db        2          0.1765       0.035
## X43a82022093106139e5e81f88ce5a975       2          0.1765       0.036
## cd5442cc61343511b0de79526804825d        2          0.1765       0.039
## f31d6dfeec66b22ac98f0605d57e7bb4        2          0.1765       0.033
## X3761a26aef7e143cb2cdc05188589703       2          0.1765       0.041
## 
## Sum of probabilities                 =  715.219 
## 
## Sum of Indicator Values              =  58.46 
## 
## Sum of Significant Indicator Values  =  7.82 
## 
## Number of Significant Indicators     =  26 
## 
## Significant Indicator Distribution
## 
##  1  2 
##  7 19

This will give us a summary of significant indicator values (with p values), and the cluster to which they ‘belong’. They are arranged by descending indicator value withtin each cluster/vegetation type.

The function also helpfully calculates the relative abundance of individual ‘species’ in each of the two clusters

head(veg_indic$relabu)
##                                   1 2
## X99fa718995b3129c258bd12bdabb4bec 1 0
## X2dcce6012bdc7a9cb15900ee79beb2ed 1 0
## X8133d2b0b8fcb93ae67ebaaec9667e7b 1 0
## X996b8c93a5184a88d3ec0d3965fe31b2 1 0
## d183373f95d7758e5883ac24de921410  1 0
## X3a51360d9283ba62bef12229ba447c63 1 0

We can be stringent and subset our indicator values to only those above a certain threshold

    #Extract The Full Dataframe
     veg_indic_output<-data.frame(indval=veg_indic$indcls,pval=veg_indic$pval,cluster=veg_indic$maxcls)
     
    #Add in The Taxon Labels as a column 
     #R adds an annoying 'X' to the leading edge of hexadecimal IDs that start with numbers, so we need to strip those out using 'gsub'
     veg_indic_output$taxon<-gsub("^X","",rownames(veg_indic_output))
     
    #Subset to Indicator Values >0.5
      veg_indic_significant<-subset(veg_indic_output,indval>=0.5 & pval <=0.05)
        head(veg_indic_significant)
##                                      indval  pval cluster
## dc8a2f47b3d1dc2e1f5f805891976b29  0.6289763 0.001       2
## a7b877ae6d2f079a15b6b192a4425620  0.5592630 0.002       2
## X6b780e361cfc5f06def718518324bdcb 0.6250000 0.001       1
##                                                              taxon
## dc8a2f47b3d1dc2e1f5f805891976b29  dc8a2f47b3d1dc2e1f5f805891976b29
## a7b877ae6d2f079a15b6b192a4425620  a7b877ae6d2f079a15b6b192a4425620
## X6b780e361cfc5f06def718518324bdcb 6b780e361cfc5f06def718518324bdcb
        nrow(veg_indic_significant)
## [1] 3

This is cool, but let’s make it more useful by adding on the taxonomy

    #Annotate With Taxonomy 
        ps_rare_tax<-data.frame(tax_table(ps_rare))
        ps_rare_tax$taxon<-rownames(ps_rare_tax)
        veg_indic_significant<-left_join(veg_indic_significant,ps_rare_tax,"taxon")
        head(veg_indic_significant)
##      indval  pval cluster                            taxon     Kingdom
## 1 0.6289763 0.001       2 dc8a2f47b3d1dc2e1f5f805891976b29 d__Bacteria
## 2 0.5592630 0.002       2 a7b877ae6d2f079a15b6b192a4425620 d__Bacteria
## 3 0.6250000 0.001       1 6b780e361cfc5f06def718518324bdcb d__Bacteria
##             Phylum               Class             Order             Family
## 1   Proteobacteria Gammaproteobacteria   Burkholderiales   Burkholderiaceae
## 2 Actinobacteriota      Actinobacteria Nitriliruptorales Nitriliruptoraceae
## 3 Actinobacteriota      Actinobacteria     Micrococcales     Micrococcaceae
##                Genus                  Species
## 1          Ralstonia                     <NA>
## 2 Nitriliruptoraceae uncultured_Nitriliruptor
## 3               <NA>                     <NA>



6.2 Differential Abundance Testing Using DESeq2

The pitfalls of differential abundance testing

Critically, we perform these functions on our unrarefied data (physeq) because DESeq2 will fit a model that accounts for the different mean-variance relationship introduced by different library sizes.

Some Convenience Functions for Handling DESeq Data

    ########## Function To Handle Results Objects and Annotate with Taxonomy
    taxo<-function(resultsobject,physeqobject,alpha){
      sigtab<-resultsobject[which(resultsobject$padj<alpha),]
      sigtab<- cbind(as(sigtab, "data.frame"), as(tax_table(physeqobject)[rownames(sigtab), ], "matrix"))
      colnames(sigtab)[7:12]<-c("Kingdom","Phylum","Class","Order","Family","Genus")
      return(sigtab)
    }          
    
    ########## Function To Calculate Geometric Means
    gm_mean = function(x, na.rm=TRUE){
      exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
    }   
    
    ###### Function to parse significant data from DESeq2 results
    deseqplot_data<-function(sigtab){
      # Phylum order
      x = tapply(sigtab$log2FoldChange, sigtab$Phylum, function(x) max(x))
      x = sort(x, TRUE)
      sigtab$Phylum = factor(as.character(sigtab$Phylum), levels=names(x))
      
      # Copy Across Genus Labels and Fill in Any Unassigned
      sigtab$Genus.long<-as.character(sigtab$Genus)
      sigtab$Genus.long[grep("unclassified",sigtab$Genus)] <-paste0("[",as.character(sigtab$Family[grep("unclassified",sigtab$Genus)]),"]")
      
      # Genus order
      x = tapply(sigtab$log2FoldChange, sigtab$Genus.long, function(x) max(x))
      x = sort(x, TRUE)
      sigtab$Genus.long = factor(as.character(sigtab$Genus.long), levels=names(x))
      return(sigtab)
    }


Fitting Models Using DESeq

Now we can fit models to our data using DESeq as follows. THE Atacama dataset is very data-poor, so stumbles on DESeq2 analyses. So we’ll use a built in dataset here called ‘GlobalPatterns’ which maps global microbial diversity across mutliple sample types. To keep things simple, we’ll only look at two sample types Soil and Ocean

data("GlobalPatterns")
table(sample_data(GlobalPatterns)$SampleType)
## 
##              Feces         Freshwater Freshwater (creek)               Mock 
##                  4                  2                  3                  3 
##              Ocean Sediment (estuary)               Skin               Soil 
##                  3                  3                  3                  3 
##             Tongue 
##                  2
gp_soilocean<-subset_samples(GlobalPatterns,SampleType %in% c("Soil","Ocean"))
gp_soilocean
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 19216 taxa and 6 samples ]
## sample_data() Sample Data:       [ 6 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 19216 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
sample_sums(gp_soilocean)
##     CL3     CC1     SV1     NP2     NP3     NP5 
##  864077 1135457  697509  523634 1478965 1652754

We use the ‘phyloseq_to_deseq2’ function in phyloseq to port the data in a fromat DESeq2 recognises, whilst also fitting the model we want to examine using the ‘depends on’ ~ symbol. Here we want to look for differential abundance of ASVs based on the 2-level factor of whether Sample Type (Soil or Ocean) (3 samples each)

  #Fit The Model We Are Interested in Using The function in phyloseq - gets the data in a format DESeq can read
    gpmod<-phyloseq_to_deseq2(gp_soilocean, ~ SampleType)
## converting counts to integer mode
library(DESeq2)

#   #Fit the DESeq model   
  gp.deseq = DESeq(gpmod, test="Wald", fitType="mean")
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
#Inspect The Results Names (comparisons we can make)
  resultsNames(gp.deseq)
## [1] "Intercept"                "SampleType_Soil_vs_Ocean"
  #Extract The Results and Specify the Contrasts We Want - Here 'No Vegetation' will be used as the reference level
    gp_results<-results(gp.deseq,contrast=c("SampleType","Soil","Ocean"))

  #Annotate The Taxa with the taxonomy
    gp_results_taxa<-taxo(gp_results,gp_soilocean,0.0001) #set threshold signfiicance 

#### Store The Data in a Format that We Can Plot With GGplot
    gp_plot_data<-deseqplot_data(gp_results_taxa)
    head(gp_plot_data)
##         baseMean log2FoldChange    lfcSE      stat       pvalue         padj
## 250392 12252.038      -13.45182 2.143351 -6.276071 3.472363e-10 2.719767e-07
## 12812  41888.526      -13.34543 2.087351 -6.393473 1.621597e-10 2.035915e-07
## 238123  3424.998      -12.66233 2.262418 -5.596816 2.183247e-08 3.461811e-06
## 535929  8721.573      -13.02494 2.299513 -5.664214 1.477000e-08 2.584240e-06
## 137837  2417.732      -13.22169 2.550415 -5.184131 2.170245e-07 1.940941e-05
## 332808 13807.033      -13.98571 2.361268 -5.922966 3.161868e-09 9.437245e-07
##         Kingdom         Phylum          Class         Order        Family Genus
## 250392  Archaea  Euryarchaeota Thermoplasmata            E2 MarinegroupII  <NA>
## 12812  Bacteria Actinobacteria Actinobacteria        koll13          <NA>  <NA>
## 238123 Bacteria  Cyanobacteria    Chloroplast   Cryptophyta          <NA>  <NA>
## 535929 Bacteria  Cyanobacteria    Chloroplast   Cryptophyta          <NA>  <NA>
## 137837 Bacteria  Cyanobacteria    Chloroplast Stramenopiles          <NA>  <NA>
## 332808 Bacteria  Cyanobacteria    Chloroplast Stramenopiles          <NA>  <NA>
##        Species Genus.long
## 250392    <NA>       <NA>
## 12812     <NA>       <NA>
## 238123    <NA>       <NA>
## 535929    <NA>       <NA>
## 137837    <NA>       <NA>
## 332808    <NA>       <NA>
#####Generate Plot
    gp_plot<-ggplot(gp_plot_data,aes(x=Genus.long, y=log2FoldChange)) + geom_point(shape=21,size=6,aes(fill=Phylum)) + theme_classic() +
      theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5,size=12)) +  scale_fill_brewer(palette="Set2")+geom_hline(yintercept = 0,linetype="dashed") + labs(x="Genus") + scale_shape_manual(values=c(21,22,23,24,25))
    gp_plot
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

6.3 Differential Abundance Testing With ALDEX2

Some people are suspicious of methods like DESeq2, because they don’t account for the compositional nature of the data that seqeuncers produce. There’s some evidence that the method is overly sensitive, producing too many false positives. Luckily we have other options - we can use a modelling framework like ALDEX2 that automatically accounts for the compositional nature of our data.

Here we’ll replicate the model above we ran for DESEq2. Note the ALDEX2 model uses the raw data and does the CLR transformation for us. So we just need to give it our subsetted Global Patterns object.

library(ALDEx2)

#Strip Out Matrix using functiion above 
    gp_soilocean_v<-t(vegan_otu(gp_soilocean))

#Strip Out a vector of covariates from the phyloseq object (needs to be two levels)
    gp_soilocean_sample<-sample_data(gp_soilocean)$SampleType

#Run ALDEX model
    gp_soilocean_aldexmodel<- aldex(gp_soilocean_v, gp_soilocean_sample, mc.samples=128, test="t", effect=TRUE,
                          include.sample.summary=TRUE, denom="all", verbose=FALSE)
## aldex.clr: generating Monte-Carlo instances and clr values
## operating in serial mode
## computing center with all features
## aldex.ttest: doing t-test
## aldex.effect: calculating effect sizes
## Warning in `[<-.data.frame`(`*tmp*`, , nm, value = structure(list(X1 =
## -2.44361877001038, : provided 12852 variables to replace 1 variables

## Warning in `[<-.data.frame`(`*tmp*`, , nm, value = structure(list(X1 =
## -2.44361877001038, : provided 12852 variables to replace 1 variables

## Warning in `[<-.data.frame`(`*tmp*`, , nm, value = structure(list(X1 =
## -2.44361877001038, : provided 12852 variables to replace 1 variables

## Warning in `[<-.data.frame`(`*tmp*`, , nm, value = structure(list(X1 =
## -2.44361877001038, : provided 12852 variables to replace 1 variables

## Warning in `[<-.data.frame`(`*tmp*`, , nm, value = structure(list(X1 =
## -2.44361877001038, : provided 12852 variables to replace 1 variables

## Warning in `[<-.data.frame`(`*tmp*`, , nm, value = structure(list(X1 =
## -2.44361877001038, : provided 12852 variables to replace 1 variables
    head(gp_soilocean_aldexmodel)
##           rab.all rab.win.Ocean rab.win.Soil rab.sample.CL3 rab.sample.CC1
## 549322 -1.4030191     0.1697115   -2.7437539      -2.443619      -3.371567
## 143239 -0.4110687    -0.5411641   -0.1762045      -2.443619      -3.371567
## 255340  0.5434051    -0.4389759    6.4850145      -2.443619      -3.371567
## 144887  0.2985363    -0.6543108    0.8125212      -2.443619      -3.371567
## 141782  0.5070370    -0.6792306    2.5683251      -2.443619      -3.371567
## 215972 -0.7550722    -0.7264933   -0.8145584      -2.443619      -3.371567
##        rab.sample.SV1 rab.sample.NP2 rab.sample.NP3 rab.sample.NP5   diff.btw
## 549322      -2.382425       1.721916      -1.320857     -0.3700944 -2.8959228
## 143239      -2.382425       1.721916      -1.320857     -0.3700944  0.9597222
## 255340      -2.382425       1.721916      -1.320857     -0.3700944  5.8325368
## 144887      -2.382425       1.721916      -1.320857     -0.3700944  1.1151942
## 141782      -2.382425       1.721916      -1.320857     -0.3700944  3.0308823
## 215972      -2.382425       1.721916      -1.320857     -0.3700944 -0.1282677
##        diff.win     effect   overlap     we.ep    we.eBH     wi.ep    wi.eBH
## 549322 4.376101 -0.5798190 0.2083343 0.3453243 0.5607102 0.4335938 0.6016757
## 143239 3.964454  0.2101402 0.4010419 0.5344543 0.7102889 0.6687500 0.7830871
## 255340 5.915275  0.8761477 0.2227988 0.3099057 0.5445720 0.4351562 0.6286235
## 144887 3.610967  0.2719419 0.3782386 0.4964911 0.6818479 0.6000000 0.7397385
## 141782 4.208694  0.6438661 0.2176175 0.4096025 0.6146385 0.4453125 0.6314579
## 215972 4.547052 -0.0185348 0.4870467 0.5328614 0.7096396 0.6367188 0.7662457
#Plot (these are built in plotting functions 
    aldex.plot(gp_soilocean_aldexmodel, type="MW", test="welch", xlab="Dispersion",ylab="Difference")

    aldex.plot(gp_soilocean_aldexmodel, type="MA", test="welch", xlab="Log-ratio abundance",ylab="Difference")

6.3.1 Custom Plotting of ALDEX2 results

######## CONVENIENCE PLOTTING FUNCTION 
    
  #Here's a function I made that will take your aldex model object, and the inout phyloseq object, and annotate the ASVs with their taxonomoy for plotting 
    ###Taxonomy Annotation Function 
    taxo_annotate<-function(aldex_obj,physeq_in){
      
      library(dplyr)
      
      ##Annotate Taxonomy  
      aldex_obj$ASV<-rownames(aldex_obj)
      physeq_tax<-data.frame(tax_table(physeq_in))
      physeq_tax$ASV<-rownames(physeq_tax)
      aldex_out<-left_join(aldex_obj,physeq_tax,"ASV")
      
      ##Annotate Significance
      aldex_out$signif<-factor(with(aldex_out,ifelse(wi.eBH < 0.01 & effect > 1 | wi.eBH < 0.01 & effect < -1,"Significant","NS")))
      
      return(aldex_out)
    }
    
    
  ### Run Function on Rnew Results
    gp_aldex_plotdata<-taxo_annotate(gp_soilocean_aldexmodel,gp_soilocean)
    
  ## How Many Sig. DIff ASVs?
    with(gp_aldex_plotdata,table(wi.eBH<0.05))
## 
## FALSE 
## 12852
  ###Plot
    library(ggplot2)
    rnew_plot<-ggplot(gp_aldex_plotdata,aes(x=diff.win,y=diff.btw)) + geom_point(aes(shape=signif,fill=Phylum)) + theme_bw() + geom_abline(intercept=0,slope=1,lty=2) + geom_abline(intercept=0,slope=-1,lty=2) + scale_shape_manual(values=c(21,22))
    rnew_plot + guides(fill = guide_legend(override.aes=list(shape=21,size=5)))

SECTION 7: FURTHER ANALYSIS

This guide has hopefully equipped you with a broad skillset of analytical tools for tackling microbiome data. Now we can apply those tools to new data to generate insights into the factors shaping host microbial communities.

7.1 Load Data

#Load Human Microbiome Project Data
  library(MicrobeDS)
  data("HMPv35")
  
#Inspect Phyloseq Object - 6000 samples
  HMPv35
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 10730 taxa and 6000 samples ]
## sample_data() Sample Data:       [ 6000 samples by 33 sample variables ]
## tax_table()   Taxonomy Table:    [ 10730 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 10730 tips and 10729 internal nodes ]

7.2 Analysing Human Microbiome Project Data

The goal of this task is to generate report on variation in human microbiome diversity and structure based on body site. Use all of the tools present in this workflow.

This dataset contains samples from four major body sites. But remember in your analyses that these data are STRUCTURED. Multiple measuremenmts from the same individual, individuals arising from different areas including urban and rural. There are all things you might want to consider in your modelling.

table(sample_data(HMPv35)$env)
## 
##    Oral    Skin   Stool Vaginal 
##    3289    1881     388     442

ANALYSIS GOALS - 1) How do the body sites differ in richness? (alpha diversity metrics, models of alpha diversity) - 2) How do the body sites differ in structure (heatmaps, ordinations) - 3) How do the answers in (3) change if we use CLR-transform on the data instead? - 4) What ASV’s are differentially abundant between body sites

Note you may want to subset the data to a core set of ASV’s first to make analyses easier.

#END

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] rbiom_1.0.2                 car_3.0-11                 
##  [3] carData_3.0-4               MuMIn_1.43.17              
##  [5] lme4_1.1-27.1               Matrix_1.3-4               
##  [7] MicrobeDS_0.1.0             forcats_0.5.1              
##  [9] stringr_1.4.0               dplyr_1.0.7                
## [11] purrr_0.3.4                 readr_2.0.1                
## [13] tidyr_1.1.3                 tibble_3.1.4               
## [15] tidyverse_1.3.1             pheatmap_1.0.12            
## [17] cowplot_1.1.1               gplots_3.1.1               
## [19] RColorBrewer_1.1-2          ALDEx2_1.24.0              
## [21] Rfast_2.0.3                 RcppZiggurat_0.1.6         
## [23] Rcpp_1.0.7                  zCompositions_1.3.4        
## [25] truncnorm_1.0-8             NADA_1.6-1.1               
## [27] survival_3.2-13             MASS_7.3-54                
## [29] labdsv_2.0-1                mgcv_1.8-37                
## [31] nlme_3.1-153                DESeq2_1.32.0              
## [33] SummarizedExperiment_1.22.0 Biobase_2.52.0             
## [35] MatrixGenerics_1.4.3        matrixStats_0.61.0         
## [37] GenomicRanges_1.44.0        GenomeInfoDb_1.28.4        
## [39] IRanges_2.26.0              S4Vectors_0.30.0           
## [41] BiocGenerics_0.38.0         qiime2R_0.99.6             
## [43] microbiome_1.14.0           ggplot2_3.3.5              
## [45] vegan_2.5-7                 lattice_0.20-45            
## [47] permute_0.9-5               phyloseq_1.36.0            
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1           backports_1.2.1        Hmisc_4.5-0           
##   [4] plyr_1.8.6             igraph_1.2.6           splines_4.1.1         
##   [7] BiocParallel_1.26.2    digest_0.6.28          foreach_1.5.1         
##  [10] htmltools_0.5.2        fansi_0.5.0            magrittr_2.0.1        
##  [13] checkmate_2.0.0        memoise_2.0.0          cluster_2.1.2         
##  [16] openxlsx_4.2.4         tzdb_0.1.2             Biostrings_2.60.2     
##  [19] annotate_1.70.0        modelr_0.1.8           RcppParallel_5.1.4    
##  [22] jpeg_0.1-9             colorspace_2.0-2       rvest_1.0.1           
##  [25] blob_1.2.2             haven_2.4.3            xfun_0.26             
##  [28] crayon_1.4.1           RCurl_1.98-1.5         jsonlite_1.7.2        
##  [31] genefilter_1.74.0      iterators_1.0.13       ape_5.5               
##  [34] glue_1.4.2             gtable_0.3.0           zlibbioc_1.38.0       
##  [37] XVector_0.32.0         DelayedArray_0.18.0    Rhdf5lib_1.14.2       
##  [40] abind_1.4-5            scales_1.1.1           DBI_1.1.1             
##  [43] xtable_1.8-4           htmlTable_2.2.1        foreign_0.8-81        
##  [46] bit_4.0.4              Formula_1.2-4          DT_0.19               
##  [49] htmlwidgets_1.5.4      httr_1.4.2             ellipsis_0.3.2        
##  [52] farver_2.1.0           pkgconfig_2.0.3        XML_3.99-0.8          
##  [55] dbplyr_2.1.1           nnet_7.3-16            sass_0.4.0            
##  [58] locfit_1.5-9.4         utf8_1.2.2             labeling_0.4.2        
##  [61] tidyselect_1.1.1       rlang_0.4.11           reshape2_1.4.4        
##  [64] AnnotationDbi_1.54.1   cellranger_1.1.0       munsell_0.5.0         
##  [67] tools_4.1.1            cachem_1.0.6           cli_3.0.1             
##  [70] generics_0.1.0         RSQLite_2.2.8          ade4_1.7-18           
##  [73] broom_0.7.9            evaluate_0.14          biomformat_1.20.0     
##  [76] fastmap_1.1.0          yaml_2.2.1             fs_1.5.0              
##  [79] knitr_1.34             bit64_4.0.5            zip_2.2.0             
##  [82] caTools_1.18.2         KEGGREST_1.32.0        slam_0.1-48           
##  [85] xml2_1.3.2             compiler_4.1.1         rstudioapi_0.13       
##  [88] curl_4.3.2             png_0.1-7              reprex_2.0.1          
##  [91] geneplotter_1.70.0     bslib_0.3.0            stringi_1.7.4         
##  [94] highr_0.9              nloptr_1.2.2.2         multtest_2.48.0       
##  [97] vctrs_0.3.8            pillar_1.6.3           lifecycle_1.0.1       
## [100] rhdf5filters_1.4.0     jquerylib_0.1.4        data.table_1.14.0     
## [103] bitops_1.0-7           R6_2.5.1               latticeExtra_0.6-29   
## [106] rio_0.5.27             KernSmooth_2.23-20     gridExtra_2.3         
## [109] codetools_0.2-18       boot_1.3-28            gtools_3.9.2          
## [112] assertthat_0.2.1       rhdf5_2.36.0           withr_2.4.2           
## [115] GenomeInfoDbData_1.2.6 hms_1.1.1              grid_4.1.1            
## [118] rpart_4.1-15           minqa_1.2.4            rmarkdown_2.11        
## [121] Rtsne_0.15             lubridate_1.7.10       base64enc_0.1-3